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1. REAL PARTY IN INTEREST 

The real party in interest is EMC Corporation, by virtue of assignments recorded at Reel 
014554 Frame 0097 and 014880 Frame 0711. 
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n. RELATED APPEALS AND INTERFERENCES 

There are no related appeals or interferences. 
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III. STATUS OF THE CLAIMS 

Claims 1-73 have been presented for examination. 
Claims 29-31 and 55-57 have been canceled. 

Claims 1-28, 32-54 and 58-73 have been finally rejected. 
Claims 1-28, 32-54, 58-65, 67-71, and 73 are being appealed. 
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IV. STATUS OF AMENDMENTS 

A Reply to Final Official Action was filed on April 13, 2010. This reply did not request 

any amendment to the specification or claims. An Advisory Action dated April 22, 2010 
indicated that this request for reconsideration did not place the application in condition for 
allowance. 
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V. SUMMARY OF CLAIMED SUBJECT MATTER 

The invention of appellants' independent claim 1 is a method of operating a network file 
server (21 in appellants' FIG. 1; appellants' specification, page 11, lines 4-8) for providing 
clients (23, 24, 25 in FIG. 1; page 1 1 lines 3-4) with concurrent write access (page 15 lines 17- 
19) to a file (FIG. 14; page 29, line 22 to page 30, line 3). (Appellants' specification, page 3, 
lines 19-21.) The method includes the network file server responding to a concurrent write 
request from a client by obtaining a lock for the file (step 101 in FIG. 1 1), and then preallocating 
a metadata block for the file (step 102 in FIG. 1 1), and then releasing the lock for the file (step 
103 in FIG. 11); and then asynchronously writing to the file (step 104 in FIG. 11); and then 
obtaining the lock for the file (step 106 in FIG. 1 1); and then committing the metadata block to 
the file (step 107 in FIG. 1 1); and then releasing the lock for the file (step 108 in FIG. 1 1). 
(Appellants' specification, page 3 line 21 to page 4 line 2; page 27 lines 3-23.) Appellants' FIG. 
1 1 is reproduced below. 

The invention of appellants' independent claim 13 is a method of operating a network file 
server (21 in appellants' FIG. 1; appellants' specification, page 11, lines 4-8) for providing 
clients (23, 24, 25 in FIG. 1; page 11 lines 3-4) with concurrent write access (page 15 lines 17- 
19) to a file (FIG. 14; page 29, line 22 to page 30, line 3). (Appellants' specification, page 4, 
lines 3-4.) The method includes the network file server responding to a concurrent write request 
fi-om a client by preallocating a block for the file (step 102 in FIG. 11); and then asynchronously 
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writing to the file (step 104 in FIG. 1 1); and then committing the block to the file (step 107 in 
FIG. 1 1). (Appellants' specification, page 4, lines 4-7; page 27 lines 4-12 and 14-20.) The 
asynchronous writing to the file includes a partial write to a new block (123 in FIG. 13) that has 
been copied at least in part from an original block (121 in FIG. 13) of the file. (Appellants' 
specification, page 4, lines 7-9; page 28 line 22 to page 29 line 7.) The method includes 
checking a partial block conflict queue (73 in FIG. 4; page 14 lines 19-22; page 15 lines 20-22) 
for a conflict with a concurrent write to the new block (step 151 in FIG. 17; page 34 lines 7-10), 
and upon finding an indication of a conflict with a concurrent write to the new block, waiting 
until resolution of the conflict with the concurrent write to the new block (step 156 in FIG. 17; 
page 34 line 24 to page 35 line 2), and then performing the partial write to the new block (step 
157 in FIG. 17; page 35 lines 2-5). (Appellants' specification, page 4, lines 7-13.) Appellants' 
FIGS. 4, 13 and 17 are reproduced below. 

The invention of appellants' independent 1 5 is a method of operating a network file 
server (21 in appellants' FIG. 1; appellants' specification, page 11, lines 4-8) for providing 
clients (23, 24, 25 in FIG. 1; page 1 1 lines 3-4) with concurrent write access (page 15 lines 17- 
19) to a file (FIG. 14; page 29, line 22 to page 30, line 3). (Appellants' specification, page 4, 
lines 14-15.) The method includes the network file server responding to a concurrent write 
request from a client by preallocating a metadata block for the file (step 102 in FIG. 11), and 
then asynchronously writing to the file (step 104 in FIG. 11), and then committing the metadata 
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block to the file (step 107 in FIG. 1 1). (Appellants' specification, page 4, lines 15-18; page 27 
lines 4-12 and 14-20.) The method further includes gathering together preallocated metadata 
blocks for a plurality of client write requests to the file (step 117 in FIG. 12), and committing 
together the preallocated metadata blocks for the plurality of client write requests to the file by 
obtaining a lock for the file (step 106 in FIG. 1 1), committing the gathered preallocated metadata 
blocks for the plurality of client write requests to the file (step 107 in FIG. 11; step 1 18 in FIG. 
12), and then releasing the lock for the file (step 108 in FIG. 1 1). (Appellants' specification, 
page 4, lines 18-23; page 27 lines 14-23; page 28, lines 9-21.) Appellants' FIG. 12 is 
reproduced below. 

The invention of appellants' independent 25 is a method of operating a network file 
server (21 in appellants' FIG. 1; appellants' specification, page 11, lines 4-8) for providing 
clients (23, 24, 25 in FIG. 1; page 1 1 lines 3-4) with concurrent read and write access (page 15 
lines 17-19) to a file (FIG. 14; page 29, line 22 to page 30, line 3). The method includes the 
network file server responding to a concurrent write request fi-om a client by preallocating a 
metadata block for the file (step 102 in FIG. 11), and then asynchronously writing to the file 
(step 104 in FIG. 1 1), and then committing the metadata block to the file (step 107 in FIG. 11). 
(Appellants' specification, page 27 lines 4-12 and 14-20.) The network file server includes disk 
storage (29 in FIGS. 1 and 2; page 1 1 lines 3-6; page 12 lines 19-21) containing a file system (54 
in FIG. 2 and FIG. 3; page 12 lines 6-9 and 19-21; page 13 lines 11-16), and a file system cache 
(51 in FIG. 2 and FIG. 3; page 12 lines 21-23; page 13 lines 8-16 ) storing data of blocks (134, 
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135, 138, 139 in FIG. 14; page 29, line 22 to page 30, line 3) of the file. The method includes 
the network file server responding to concurrent write requests by writing new data for specified 
blocks of the file to the disk storage without writing the new data for the specified blocks of the 
file to the file system cache, and invalidating the specified blocks of the file in the file system 
cache. (Steps 515 and 516 in FIG. 10; page 23 lines 19-23 and page 24 hnes 7-15.) The method 
fijrther includes the network file server responding to read requests for file blocks not found in 
the file system cache by reading the file blocks from the file system in disk storage and then 
checking whether the file blocks have become stale due to concurrent writes to the file blocks, 
and writing to the file system cache a file block that has not become stale, and not writing to the 
file system cache a file block that has become stale. (Steps 92, 97, 513, 98 in FIG. 10; page 23 
lines 5-17 page 24 lines 16-22.) Appellants' FIG. 10 is reproduced below. 

The invention of appellants' independent claim 32 is a method of operating a network file 
server (21 in appellants' FIG. 1; appellants' specification, page 11, lines 4-8) for providing 
clients (23, 24, 25 in FIG. 1; page 1 1 lines 3-4) with concurrent write access (page 15 lines 17- 
19) to a file (FIG. 14; page 29, line 22 to page 30, line 3). (Appellants' specification, page 5, 
lines 1-2.) The method includes the network file server responding to a concurrent write request 
from a client by executing a write thread (FIG. 1 1). (Appellants' specification, page 4, lines 3- 
4.). Execution of the write thread includes obtaining an allocation mutex for the file (step 101 in 
FIG. 11), and then preallocating new metadata blocks that need to be allocated for writing to the 
file (step 102 in FIG. 11), and then releasing the allocation mutex for the file (step 103 in FIG. 
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1 1); and then issuing asynchronous write requests for writing to the file (step 104 in FIG. 1 1), 
waiting for callbacks indicating completion of the asynchronous write requests (step 105 in FIG. 
11), and then obtaining the allocation mutex for the file (step 106 in FIG. 11), and then 
committing the preallocated metadata blocks (step 107 in FIG. 11), and then releasing the 
allocation mutex for the file (step 108 in FIG. 1 1). (Appellants' specification, page 4, lines 4-10; 
page 27 lines 3-23.) 

The invention of appellants' independent claim 33 is a network file server (21 in 
appellants' FIG. 1; appellants' specification, page 11, lines 4-8). (Appellants' specification, page 
5, line 11.) The network file server includes storage (29 in FIGS. 1 and 2; page 1 1 lines 3-6; 
page 12 lines 19-21) for storing a file (FIG. 14; page 29, line 22 to page 30, line 3), and at least 
one processor (26, 27, 28 in FIG. 1; page 1 1 lines 4-6) coupled to the storage for providing 
clients (23, 24, 25 in FIG. 1; page 1 1 lines 3-4) with concurrent write access (page 15 lines 17- 
19) to the file. (Appellants' specification, page 5, hnes 12-13.) The network file server is 
programmed for responding to a concurrent write request fi-om a client by obtaining a lock for 
the file (step 101 in FIG. 11), and then preallocating a metadata block for the file (step 102 in 
FIG. 11), and then releasing the lock for the file (step 103 in FIG. 1 1), and then asynchronously 
writing to the file (step 104 in FIG. 11), and then obtaining the lock for the file (step 106 in FIG. 
11), and then committing the metadata block to the file (step 107 in FIG. 11), and then releasing 
the lock for the file (step 108 in FIG. 11). (Appellants' specification, page 5, lines 13-18; page 
27 lines 3-23.) 
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The invention of appellants' independent claim 47 is a network file server (21 in 
appellants' FIG. 1; appellants' specification, page 11, lines 4-8). (Appellants' specification, page 
5, line 19.) The network file server includes storage (29 in FIGS. 1 and 2; page 1 1 lines 3-6; 
page 12 lines 19-21) for storing a file (FIG. 14; page 29, line 22 to page 30, line 3), and at least 
one processor (26, 27, 28 in FIG. 1; page 1 1 lines 4-6) coupled to the storage for providing 
clients (23, 24, 25 in FIG. 1; page 1 1 lines 3-4) with concurrent write access (page 15 lines 17- 
19) to the file. (Appellants' specification, page 5, lines 20-21.) The network file server is 
programmed for responding to a concurrent write request from a client by preallocating a block 
for the file (step 102 in FIG. 1 1); and then asynchronously writing to the file (step 104 in FIG. 
11); and then committing the block to the file (step 107 in FIG. 11). (Appellants' specification, 
page 5, line 21 to page 6, line 1; page 27 lines 4-12 and 14-20.) The network file server 
includes a partial block conflict queue (73 in FIG. 4; page 14 lines 19-22; page 15 lines 20-22) 
for indicating a concurrent write to a new block that is being copied at least in part from an 
original block of the file. (Appellants' specification, page 6, lines 1-3.) The network file server 
is programmed for responding to a client request for a partial write to the new block by checking 
the partial block conflict queue for a conflict (step 151 in FIG. 17; page 34 lines 7-10), and upon 
finding an indication of a conflict, waiting until resolution of the conflict with the concurrent 
write to the new block of the file (step 156 in FIG. 17; page 34 line 24 to page 35 line 2), and 
then performing the partial write to the new block of the file (step 157 in FIG. 17; page 35 lines 
2-5). (Appellants' specification, page 6, lines 3-7.) 



16 



Serial No.: 10/668.467 
Reinstated Appeal Brief 



The invention of appellants' independent claim 49 is a network file server (21 in 
appellants' FIG. 1; appellants' specification, page 11, lines 4-8). (Appellants' specification, page 
6, line 8.) The network file server includes storage (29 in FIGS. 1 and 2; page 1 1 lines 3-6; page 
12 lines 19-21) for storing a file (FIG. 14; page 29, line 22 to page 30, line 3), and at least one 
processor (26, 27, 28 in FIG. 1; page 1 1 lines 4-6) coupled to the storage for providing clients 
(23, 24, 25 in FIG. 1; page 1 1 lines 3-4) with concurrent write access (page 15 lines 17-19) to the 
file. (Appellants' specification, page 6, lines 9-10.) The network file server is programmed for 
responding to a concurrent write request from a client by preallocating a metadata block for the 
file (step 102 in FIG. 1 1), and then asynchronously writing to the file (step 104 in FIG. 1 1), and 
then committing the metadata block to the file (step 107 in FIG. 11). (Appellants' specification, 
page 6, lines 10-13; page 27 lines 4-12 and 14-20.) The network file server is programmed for 
gathering together preallocated metadata blocks for a plurality of client write requests to the file 
(step 1 17 in FIG. 12), and committing together the preallocated metadata blocks for the plurality 
of client write requests to the file by obtaining a lock for the file (step 106 in FIG. 11), 
committing the gathered preallocated metadata blocks for the plurality of client write requests to 
the file (step 107 in FIG. 11; step 1 18 in FIG. 12), and then releasing the lock for the file (step 
108 in FIG. 1 1). (Appellants' specification, page 6, lines 13-18; page 27 lines 14-23; page 28, 
lines 9-21.) 

The invention of appellants' independent 51 is a network file server (21 in appellants' 
FIG. 1; appellants' specification, page 11, lines 4-8). The network file server includes disk 
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storage (29 in FIGS. 1 and 2; page 11 lines 3-6; page 12 lines 19-21) containing a file system (54 
in FIG. 2 and FIG. 3), and a file system cache (51 in FIG. 2 and FIG. 3; page 12 lines 21-23; 
page 13 lines 8-16) storing data of blocks of a file (FIG. 14; page 29, line 22 to page 30, line 3) 
in the file system. (Appellants' specification, page 6, lines 20-22.) The network file server is 
programmed for responding to a concurrent write request from a client (23, 24, 25 in FIG. 1; 
page 1 1 lines 3-4) by preallocating a metadata block for the file (step 102 in FIG. 11); and then 
asynchronously writing to the file (step 104 in FIG. 11), and then committing the metadata block 
to the file (step 107 in FIG. 1 1). (Appellants' specification, page 27 lines 4-12 and 14-20.) The 
network file server is fiirther programmed for responding to concurrent write requests by writing 
new data for specified blocks of the file to the disk storage without writing the new data for the 
specified blocks of the file to the file system cache, and invalidating the specified blocks of the 
file in the file system cache. (Steps 515 and 516 in FIG. 10; page 23 lines 19-23 and page 24 
lines 7-15.) The network file server is programmed for responding to concurrent read requests 
for file blocks not found in the file system cache by reading the file blocks from the file system 
in disk storage and then checking whether the file blocks have become stale due to concurrent 
writes to the file blocks, and writing to the file system cache a file block that has not become 
stale, and not writing to the file system cache a file block that has become stale. (Steps 92, 97, 
513, 98 in FIG. 10; page 23 lines 5-17 page 24 lines 16-22.) 

The invention of appellants' independent claim 58 is a network file server (21 in 
appellants' FIG. 1; appellants' specification, page 11, lines 4-8). (Appellants' specification, page 
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6, lines 19-20.) The network file server includes storage (29 in FIGS. 1 and 2; page 1 1 lines 3-6; 
page 12 lines 19-21) for storing a file (FIG. 14; page 29, line 22 to page 30, line 3), and at least 
one processor (26, 27, 28 in FIG. 1; page 1 1 lines 4-6) coupled to the storage for providing 
clients (23, 24, 25 in FIG. 1; page 1 1 lines 3-4) with concurrent write access (page 15 lines 17- 
19) to the file. (Appellants' specification, page 6, lines 20-22.) The network file server is 
programmed with a write thread (FIG. 1 1) for responding to a concurrent write request fi-om a 
client by obtaining an allocation mutex for the file (step 101 in FIG. 11), and then preallocating 
new metadata blocks that need to be allocated for writing to the file (step 102 in FIG. 1 1), and 
then releasing the allocation mutex for the file (step 103 in FIG. 1 1), and then issuing 
asynchronous write requests for writing to the file (step 101 in FIG. 1 1), waiting for callbacks 
indicating completion of the asynchronous write requests (step 101 in FIG. 11), and then 
obtaining the allocation mutex for the file (step 106 in FIG. 11); and then committing the 
preallocated metadata blocks (step 107 in FIG. 1 1), and then releasing the allocation mutex for 
the file (step 108 in FIG. 1 1). (Appellants' specification, page 6, line 22 to page 7, line 6; page 
27 lines 3-23.) 

The invention of appellants' independent claim 61 is a network file server (21 in 
appellants' FIG. 1; appellants' specification, page 11, lines 4-8). (Appellants' specification, page 

7, line 7.) The network file server includes storage (29 in FIGS. 1 and 2; page 1 1 lines 3-6; page 
12 lines 19-21) for storing a file (FIG 14; page 29, line 22 to page 30, line 3), and at least one 
processor (26, 27, 28 in FIG. 1; page 1 1 lines 4-6) coupled to the storage for providing clients 
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(23, 24, 25 in FIG. 1; page 1 1 lines 3-4) with concurrent write access (page 15 lines 17-19) to the 
file. (Appellants' specification, page 7, lines 8-9.) The network file server is programmed for 
responding to a concurrent write request fi-om a client by preallocating a block for writing to the 
file (step 102 in FIG. 11), asynchronously writing to the file (step 104 in FIG. 11), and then 
committing the preallocated block (step 107 in FIG. 1 1). (Appellants' specification, page 7, lines 
9-12; page 27 lines 4-12 and 14-20.) The network file server also includes an uncached write 
interface (63 in FIG. 3), a file system cache (51 in FIG. 2 and FIG. 3), and a cached read- write 
interface (61 in FIG. 3). (Appellants' specification, page 7, lines 12-13; page 13 line 8 to page 
14 line 17.) The uncached write interface bypasses the file system cache for sector-aligned write 
operations (FIG. 3; step 85 in FIG. 5), and the network file server is programmed to invalidate 
cache blocks in the file system cache including sectors being written to by the uncached write 
interface (FIG. 6, steps 88 and 89). (Appellants' specification, page 7, lines 13-16; page 13 lines 
17-22; page 17 lines 12-15; page 18 lines 8-14.) Appellants' FIGS. 3, 5, and 6 are reproduced 
below. 

It is respectfiilly submitted that none of appellants' claims contain any "means plus 
fianction" or "step plus function" as permitted by 35 U.S.C. 1 12, sixth paragraph. 

The invention of appellants' dependent claims 2 and 34 the file (FIG. 14) further includes 
a hierarchy of blocks including an inode block (131 in FIG. 14) of metadata, data blocks (132, 
134, 135, 138, 139 in FIG. 14) of file data, and indirect blocks of metadata (133, 136, 137 in 
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FIG. 14), and wherein the metadata block for the file is an indirect block of metadata. 
(Appellants' specification, page 14 lines 1-5, page 15 lines 6-8 and 17-18, page 24 lines 8-9, 
page 27 lines 4-6, page 29 line 22 to page 30 line 3). 

The invention of appellants' dependent claims 3 and 35 further includes copying data 
from an original indirect block of the file (121 in FIG. 13; 136 in FIG. 14 and FIG. 15) to the 
metadata block for the file (123 in FIG. 13; 141 in FIG. 16), the original indirect block of the file 
having been shared between the file and a read-only version of the file. (Appellants' 
specification, page 15 lines 8-11; page 22 line 22 to page 3 1 line 7; page 28 line 22 to page 29 
line 7.) 

The inventions of appellants' dependent claims 4 and 36 further include concurrent 
writing for more than one client to the metadata block for the file. (Appellants' specification, 

page 15, lines 17-22.) 

The inventions of appellants' dependent claims 5 and 37 further include asynchronous 
writing to the file including a partial write to a new block (123 in FIG. 13) that has been copied 
at least in part from an original block (121 in FIG. 13) of the file, and wherein the method further 
includes checking a partial block conflict queue (73 in FIG. 4; page 14 lines 19-22; page 15 lines 
17-22) for a conflict with a concurrent write to the new block (step 151 in FIG. 17; page 34 lines 
7-10), and upon failing to find an indication of a conflict with a concurrent write to the new 
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block, preallocating the new block (step 102 in FIG. 1 1), copying at least a portion of the original 
block of the file to the new block (step 153 in FIG. 17), and performing the partial write to the 
new block (step 154 in FIG. 17). (Appellants' specification, page 27 lines 4-6, page 28 line 22 to 
page 29 line 7, page 34 line 7 to page 35 line 5.) 

The inventions of appellants' dependent claims 6 and 38 are similar to claim 13 to the 
extent that claims 6 and 38 further define that the asynchronous writing to the file includes a 
partial write to a new block (123 in FIG. 13) that has been copied at least in part from an original 
block (121 in FIG. 13) of the file, and wherein the method further includes checking a partial 
block conflict queue (73 in FIG. 4; page 14 lines 19-22; page 15 lines 17-22) for a conflict with a 
concurrent write to the new block (step 151 in FIG. 17; page 34 lines 7-10), and upon finding an 
indication of a confiict with a concurrent write to the new block, waiting until resolution of the 
conflict with the concurrent write to the new block (step 156 in FIG. 17; page 34 line 24 to page 
35 line 2), and then performing the partial write to the new block (step 157 in FIG. 17; page 35 
lines 2-5). 

The inventions of appellants' dependent claims 1 1 and 45 are similar to claim 15 to the 
extent that claims 1 1 and 45 fiirther define gathering together preallocated metadata blocks for a 
plurality of client write requests to the file (step 1 17 in FIG. 12), and committing together the 
preallocated metadata blocks for the plurality of client write requests to the file by obtaining the 
lock for the file (step 106 in FIG. 11), committing the gathered preallocated metadata blocks for 
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the plurality of client write requests to the file (step 107 in FIG. 11; step 1 18 in FIG. 12), and 
then releasing the lock for the file (step 108 in FIG. 11). (Appellants' specification, page 4, lines 
18-23; page 27 lines 14-23; page 28, lines 9-21.) 

The inventions of appellants' dependent claims 12, 16, 46, and 50 further include 
checking whether a previous commit is in progress (step 1 1 1 in FIG. 12) after asynchronously 
writing to the file (steps 103 and 104 in FIG. 11) and before obtaining the lock for the file for 
committing the metadata block to the file (step 1 12 or step 1 18 in FIG. 12), and upon finding that 
a previous commit is in progress, placing a request for committing the metadata block to the file 
on a staging queue (76 in FIG. 4) for the file (step 1 17 in FIG. 12). (Appellants' specification, 
page 28 lines 1-4 and 9-21.) 

The inventions of appellants' dependent claim 17 is similar to claim 25 to the extent that 
claim 17 further defines the network file server responding to concurrent write requests by 
writing new data for specified blocks of the file to the disk storage without writing the new data 
for the specified blocks of the file to the file system cache, and invalidating the specified blocks 
of the file in the file system cache. (Steps 515 and 516 in FIG. 10; page 23 lines 19-23 and page 
24 lines 7-15.) 

The invention of appellants' dependent claim 18 is similar to claim 25 to the extent that 
claim 18 fiarther defines the network file server responding to read requests for file blocks not 
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found in the file system cache by reading the file blocks fi-om the file system in disk storage and 
then checking whether the file blocks have become stale due to concurrent writes to the file 
blocks, and writing to the file system cache a file block that has not become stale, and not writing 
to the file system cache a file block that has become stale. (Steps 92, 97, 513, 98 in FIG. 10; 
page 23 lines 5-17 page 24 lines 16-22.) 

The inventions of appellant's dependent claims 19, 26 and 52 further include the network 
file server checking a read-in-progress flag for a file block (step 510 in FIG. 10) upon finding 
that the file block is not in the file system cache (step 92 in FIG. 10), and upon finding that the 
read-in-progress flag indicates that a prior read of the file block is in progress fi-om the file 
system in the disk storage, waiting for completion of the prior read of the file block from the file 
system in the disk storage (step 511 in FIG. 10), and then again checking whether the file block 
is in the file system cache (step 92 in FIG. 10). (Appellants' specification, page 22 line 19 to 
page 23 line 9.) 

The inventions of appellant's dependent claims 20, 27 and 53 further includes the 
network file server setting a read-in-progress flag for a file block (step 512 in FIG. 10) upon 
finding that the file block is not in the file system cache (step 92 in FIG. 10) and then beginning 
to read the file block from the file system in disk storage (step 512 in FIG. 10), clearing the read- 
in-progress flag upon writing to the file block on disk (steps 515 and 516 in FIG. 10), and 
inspecting the read-in-progress flag (step 513 in FIG. 10) to determine whether the file block has 
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become stale due a concurrent write to the file block. (Appellants' specification, page 23 line 10 
to page 24 line 6.) 

The inventions of appellant's dependent claim 21, 28 and 54 fiirther include the network 
file server maintaining a generation count (step 5 12 in FIG. 10) for each read of a file block from 
the file system in the disk storage in response to a read request for a file block that is not in the 
file system cache (step 92 in FIG. 10), and checking whether a file block having been read fi-om 
the file system in the disk storage has become stale by checking whether the generation count for 
the file block having been read fi-om the file system is the same as the generation count for the 
last read request for the same file block (step 513 in FIG. 10). 

The invention of appellants' dependent claim 59 is similar to claim 61 to the extent that 
claim 59 further defines an uncached write interface (63 in FIG. 3), a file system cache (51 in 
FIG. 2 and FIG. 3) and a cached read-write interface (61 in FIG. 3; appellants' specification, 
page 7, lines 12-13; page 13 line 8 to page 14 line 17) wherein the uncached write interface 
bypasses the file system cache for sector-aligned write operations. (FIG. 3; step 85 in FIG. 5; 
page 13 lines 17-23; page 17 lines 12-15.) 

The invention of appellants' dependent claim 60 is similar to claim 61 to the extent that 
claim 60 defines that the network file server is further programmed to invalidate cache blocks in 
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the file system cache including sectors being written to by the uncached write interface. (FIG. 6, 
steps 88 and 89; appellants' specification, page 18 lines 8-14.) 
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VI. GROUNDS OF REJECTION TO BE REVIEWED ON APPEAL 

1. Whether claims 1-4, 10, 11, 15, 22, 32, 33-36, 44, 45, 49, 58, and 61-65 and 67- 
71 and 73 are unpatentable under 35 U.S.C. 102(e) as being anticipated by Burns et al. U.S. 
Patent 6,925,515 B2. 

2. Whether claims 5-9, 12-14, 16-21, 23-28, 37-43, 46-48, and 50-54 are 
unpatentable under 35 U.S.C. 103(a) over Burns et al. U.S. Patent 6,925,5 15 B2 in view of 
Marcotte U.S. Patent 6,449,614 Bl. 

3. Whether claims 59 and 60 are unpatentable under 35 U.S.C. 103(a) over Bums et 
al. U.S. Patent 6,925,515 B2 in view of Xu et al. US 6,324,581 Bl. 
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Vn. ARGUMENT 

1. Claims 1-4. 10. 11. 15. 22. 32. 33-36. 44. 45. 49. 58. 61-65. 67-71. and 73 are 
patentable under 35 U.S.C. 102(e) and are not anticipated by Burns et al. U.S. Patent 
6.925.515 B2 . 

"For a prior art reference to anticipate in terms of 35 U.S.C. § 102, every element of the 
claimed invention must be identically shown in a single reference." Diversitech Corp. v. Century 
Steps. Inc. . 7 U.S.P.Q.2d 1315, 1317 (Fed. Cir. 1988), quoted in In re Bond . 910 F.2d 831, 15 
U.S.P.Q.2d 1566, 1567 (Fed. Cir. 1990) (vacating and remanding Board holding of anticipation; the 
elements must be arranged in the reference as in the claim under review, although this is not an ipsis 
verbis test). 

Claim Interpretation 

With respect to appellants' claim 1, the word "then" appears at the end of each of steps 
(a), (b), (c), (d), (e), and (f). The word "then" cannot be rendered meaningless. Thus, the 
appellants construe the steps of claim 1 to be performed in a sequence in the order recited. In 
addition, step (a) is different from step (e), and step (c) is different from step (g). 

"While not an absolute rule, all claim terms are presumed to have meaning in a claim." 
Innova/Pure Water v. Safari Water Filtration Svs .. Inc., 381 F.3d 1111, 1119 (Fed. Cir. 
2004)(defendant's claim construction impermissibly read the term "operatively" out of the 
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phrase "operatively connected"). Moreover, if the specification does not reveal any special 
definition for the phrase "; and then", then the phrase must be construed according to its ordinary 
meaning that the term would have to a person of ordinary skill in the art in question at the time of 
the invention. Phillips v. AWH Corp .. 415 F.3d 1303, 1312-13 (Fed. Cir. 2005)(en banc). 
Dictionaries are among the many tools that can assist the court in determining the meaning of 
particular terminology to those of skill in the art of the invention. Id-, at 1318. 

Appellants respectfully submit that the ordinary meaning of the phrase "X, and then Y" is 
that Y is "following next after [X] in order". This ordinary meaning also avoids an issue of 
"double inclusion" that arises if step (a) were construed to be the same step as step (e), and step 
(c) were construed to be the same step as step (g).. Appellants respectfully submit that it is 
unreasonable to construe claim 1 so as to raise an issue of "double inclusion." See MPEP 
2173. 05(o) Double Inclusion. Appellants respectfully submit that similar language in their other 
claims should be construed in the same fashion. 

Claims 1. 10. 33. 44. 62. 68 

Appellants' claim 1 calls for a network file server responding to a concurrent write 
request from a client for access to a file by a sequence of seven specific steps performed in a 
specific order. The network file server responds by: 

(a) obtaining a lock for the file and then 

(b) preallocating a metadata block for the file; and then 

(c) releasing the lock for the file; and then 
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(d) asynchronously writing to the file; and then 

(e) obtaining the lock for the file; and then 

(f) committing the metadata block to the file in the data storage; and then 

(g) releasing the lock for the file. 

The appellants' drawings show these steps (a) to (g) in FIG. 11 in boxes 101, 102, 103, 104-105, 
106, 107, and 108, respectively, as described in appellants' specification on page 27 lines 3-23. 

In appellants' view, Burns does not disclose a network file server computer responding to 
a concurrent write request fi-om a client by performing the appellants' claimed sequence of seven 
steps. Instead, Bums discloses a network file server computer that responds to concurrent write 
requests by performing a different sequence of steps depending on the "locking mode" 
associated with the request. The citation on page 3 of the final Official Action to Bums col. 10 
lines 14-30, however, indicates that the Official Action has identified the "locking mode" and the 
kind of write operation in Burns that is most pertinent to the subject matter of appellants' claim 
1. 

Bums (FIG. 3) discloses a Client/Server distributed file system on a storage area network 
(SAN). (Col. 5, lines 64-65.) The metadata is stored separately from the data. The metadata is 
stored in a file server cluster. The data is stored on shared disks of the SAN. (Col. 7, lines 1-9.) 
In a preferred embodiment, the clients of the distributed file system include a DBMS server, and 
a plurality of web servers. The DBMS server may update a web page, and the web servers may 
read the web page as the update occurs. The DBMS server obtains an exclusive "producer lock" 
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on the web page. The producer lock enables the holder to write data, and allocate and cache data 
for an "out of place" write that writes the data to a different physical storage location than from 
which it was read. By performing an "out-of-place" write, the old data still exists and is 
available to clients. Once the writer completes the write and releases the producer lock, the 
previous data is invalidated and the clients are informed of the new location of the data. Clients 
can then read the new data from storage when needed, and the server reclaims the old data 
blocks. (Bums, col. 5 lines 31-54; col. 10 lines 13-29.) 

Page 3 of the final Official Action cites Burns col. 10 lines 14-30 for appellants' step (b) 
of preallocation, and also for appellants' step (c) of releasing the lock for the file. Page 3 of the 
final Official Action cites Bums col. 10 lines 30-38 for appellants' step (d) for asynchronously 
writing to the file. However, Burns col. 10 lines 14-30 do not disclose that the lock for the file 
(obtained in the first step (a)) is released after step (b) and before step (d), nor do Bums col. 10 
lines 14-30 disclose that the lock for the file is obtained again in step (e) and released again in 
step (g), as recited in appellants' claim 1. 

With respect to appellants' step (b) of "preallocating a metadata block for the file," page 
3 lines 1-3 of the final Official Action says: "Note that the file is allocated before the write takes 
place making this a pre-allocation." However, the fact that the file is allocated before the write 
takes place is not a sufficient disclosure of preallocation of a metadata block in response to the 
client write request and after obtaining a lock for the file, as defined in claim 1 and many of 
appellants' other independent claims. For example, for the "in-place" write of Bums, the file and 
the storage of the file to receive the new data for the write operation is allocated before the client 
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request for the "in-place" write and before a shared lock is placed on the file in response to the 
client request for the "in-place" write. 

With respect to appellants' step (c) of "releasing the lock for the file," page 3 lines 3-4 of 
the final Official Action says: "Note that on[c]e the file has been allocated it is unlocked for 
write." However, Burns col. 10, lines 14-30, does not disclose that the file server responds to a 
client write request by locking the file for allocation and then unlocking it for the write, as 
defined in appellants' claim 1. Instead, Bums col. 10 lines 14-30 disclose that "for database and 
parallel applications, the write privilege is granted on a shared lock and must be differentiated 
from allocation. Write, in this case, means an in-place write, where data are written back to the 
same physical location from which they were, or could have been, read. This is to be 
differentiated fi-om out-of-place write, where portions of a file are written to physical storage 
locations that differ fi-om the locations fi-om which they were read." 

More importantly, Burns col. 10, lines 14-30, do not disclose the network file server 
responding to a concurrent write request from a client by obtaining a lock for the file, and then 
preallocating a metadata block for the file, and then releasing the lock for the file, and then 
asynchronously writing to the file. Instead, Bums col. 10, lines 14-30 discloses and explicitly 
differentiates two different kinds of writes. The first kind of write is the "in-place" write, where 
the write privilege is granted on a shared lock, for example for database and parallel applications. 
The second kind of write is the out-of-place write, which should properly be considered an 
allocation followed by a write. As fiirther disclosed in Bums col. 10 lines 35-37, allocation 
requires an exclusive lock. 
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Although Burns does not disclose the details of the two different kinds of writes, the 
"out-of-place" write of Bums could be performed by the following sequence in response to a 
first client write request: 

a) obtaining the exclusive lock for the file; and then 

b) obtaining allocations of data blocks for the new data; and then 

c) writing new data to the allocated data blocks; and then 

d) committing the allocated data blocks to the file in the data storage; and then 

e) releasing the exclusive lock for the file. 

The "in-place" write of Bums could be performed by the following sequence in response 
to a second client write request: 

a) obtaining the shared lock on the file; and then 

b) asynchronously writing data to the file; and then 

c) releasing the shared lock on the file. 

These hypothetical sequences for the "out-of-place" write and the "in-place" write of 
Bums are entirely consistent with the "Notes" in the final Official Action, yet the sequence of 
steps a) to g) of applicants' claim 1 is neither disclosed nor obvious from these hypothetical 
sequences. Thus, the appellants maintain that there is no disclosure or suggestion in Burns of 
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responding to a client write request by releasing the lock for the file after allocation of the data 
blocks (or preallocating a metadata block for the file) and before writing the new data to the file, 
nor is there any disclosure or suggestion of again obtaining the lock on the file after writing the 
new data to the file and before committing the allocated data blocks (or the metadata block) to 
the file in the data storage. In addition. Bums does not disclose details of how a block of data is 
allocated to the file, or committed to the file. For example, there is nothing in Burns disclosing 
that a metadata block is preallocated for the file, and then the file is asynchronously written to, 
and then the metadata block is committed to the file in data storage, as recited in appellants' 
claim 1. 

Bums teaches away from releasing the lock for the file in step (c) and then 
asynchronously writing to the file in step (d) and then again obtaining the lock for the file in step 
(e), because Bums, as set out above, teaches that the file would be exclusively write-locked by 
the producer lock when a write operation adding a block to the file is performed on the file. See, 
for example, Burns col. 1 1, lines 47-49: "With C and P locks, the web servers are not updated 
until the P lock holder releases the lock, generally on closing the file." See also Bums, Abstract: 
"This system is implemented using two whole file locks : a producer lock P and a consumer lock 
C." (Emphasis added.) 

Claims 2 and 34 

With respect to appellants' dependent claims 2 and 34, page 3 of the Official Action cites 
Bums column 10 line 14 through column 11 line 14. Although Burns discloses that a P lock 
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holder can update location metadata at the server (col. 10 lines 49-53), Burns does not disclose 
details of this metadata structure such as an indirect block of metadata, as recited in appellants' 
claim 2. 

Claims 3 and 35 

With respect to appellants' dependent claims 3 and 35, page 3 of the Official Action cites Burns 
column 10 line 14 through column 11 line 14. However, Bums does not disclose details of this 
metadata structure such as an indirect block of metadata, or the use of this metadata structure 
such as copying data from an original indirect block of the file to the metadata block for the file, 
the original indirect block of the file having been shared between the file and a read-only version 
of the file, as recited in appellants' claim 3. 

Claim 4 and 36 

With respect to appellants' dependent claims 4 and 36, page 3 of the Official Action cites 
Bums, column 10 line 14 through column 11 line 14. However, Bums does not disclose 
concurrent writing for more than one client to a metadata block for the file. Instead, as discussed 
above, Burns discloses that a write operation that would change the allocation of a block requires 

an exclusive lock. 
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Claims 11 and 45 

With respect to appellants' dependent claims 1 1 and 45, page 4 of the Official Action 
cites Burns column 10 line 14 through column 11 line 14. Bums, however, does not disclose 
gathering together preallocated metadata blocks for a plurality of client write requests to the file, 
and committing together the preallocated metadata blocks for the plurality of client write 
requests to the file by obtaining the lock for the file, committing the gathered preallocated 
metadata blocks for the plurality of client write requests to the file, and then releasing the lock 
for the file. Instead, if a client obtains a producer lock for a change in allocation of the file, this 
producer lock is exclusive and prevents other clients from obtaining a producer lock on the file. 
(See col. 10, lines 50-65.) The producer lock would be obtained and held before and while any 
new data blocks, and their metadata, would be preallocated and later committed. Therefore 
Bums teaches away from gathering together preallocated metadata blocks for a plurality of client 
write requests to the file, and committing together the preallocated metadata blocks for the 
plurality of client write requests to the file by obtaining the lock for the file, committing the 
gathered preallocated metadata blocks for the plurality of client write requests to the file, and 
then releasing the lock for the file. 

Claims 15. 22. 49. 64. and 70 

With respect to appellants' independent claims 15 and 49, as discussed above with 
reference to appellants' claim 11, Burns fails to disclose gathering together preallocated 
metadata blocks for a plurality of client write requests to the file, and committing together the 
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preallocated metadata blocks for the plurality of client write requests to the file by obtaining the 
lock for the file, committing the gathered preallocated metadata blocks for the plurality of client 
write requests to the file, and then releasing the lock for the file. 

Claims 32. 58. 67. and 73 

With respect to appellants' independent claims 32 and 58, see appellants' remarks above 
with reference to appellants' claim 1. In short. Bums does not disclose details of how a new 
metadata block is added to the file. There is no disclosure or suggestion in Burns of responding 
to a client write request by releasing the allocation mutex for the file afl;er allocation of the data 
blocks (or preallocating a metadata block for the file) and before writing the new data to the file, 
nor is there any disclosure or suggestion of again obtaining the allocation mutex for the file afl;er 
writing the new data to the file and before committing the allocated data blocks (or the metadata 
block) to the file in the data storage. Therefore, appellants respectfully submit that Burns does 
not disclose that new metadata bocks are added to the file by the appellants' recited sequence of 
the steps (a), (b), (c), (d) & (e), (f), (g), and (h). 

Claim 61 

With respect to appellants' independent claim 61, Burns discloses that a change in 
allocation requires an exclusive lock. Bums, col. 10, lines 35-37. In Bums, there can be 
multiple concurrent writers that write directly to the data storage of the SAN, thereby bypassing 
the server cluster. In this case, however. Database and Parallel Application Locking (instead of 
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File System Locking) is used to ensure data consistency. See the table at the top of column 10 in 
Bums. Thus, Bums does not disclose that the network file server includes both an uncached 
write interface and a cached write interface in which the network file server is further 
programmed to invalidate cache blocks in a file system cache including sectors being written to 
by an uncached write interface. Instead, Bums teaches a different way of ensuring consistency 
between client caches and the result of an "out-of-place" write that changes the allocation of the 
file. The "out-of-place" writer obtains an exclusive "producer lock" on the file, and then writes 
to the file. "The writer must release the P lock on close to publish the file. The server sends 
location updates to all clients that hold consumer locks 410. The clients immediately invalidate 
the affected blocks in their cache." (Bums, column 1 1, lines 34-38.) Thus, Bums teaches the 
network file server invalidating client caches when file system locking is used, rather than the 
network file server invalidating a file system cache of a cached read-write interface when an 
uncached write interface bypasses the file system cache by performing a sector-aligned write 
operafion. 

Claims 63 and 65 

Appellants' dependent claims 63 and 65 are dependent upon claim 13, and therefore 
incorporate by reference the limitations of claim 13. Claim 13 is distinguished from Burns 
because, as recognized on page 8 of the final Official Action of January 13, 2010, Bums does not 
disclose that the asynchronous writing to the file includes a partial write to a new block that has 
been copied at least in part fi-om an original block of the file, and wherein the method includes 
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checking a partial block conflict queue for a conflict with a concurrent write to the new block, 
and upon finding an indication of a conflict with a concurrent write to the new block, waiting 
until resolution of the conflict with the concurrent write to the new block, and then performing 
the partial write to the new block. 

Claims 65 and 71 

Appellants' dependent claims 65 and 71 are dependent upon claim 25, and therefore 
incorporate by reference the limitations of claim 25. Claim 25 is distinguished from Burns 
because, as recognized on page 18 of the final Official Action of January 13. 2010, Burns does 
not disclose the network file server computer responding to concurrent write requests by writing 
new data for specified blocks of the file to the disk storage without writing the new data for the 
specified blocks of the file to the file system cache, and invalidating the specified blocks of the 
file in the file system cache, and Burns also does not disclose the network file server computer 
responding to read requests for file blocks not found in the file system cache by reading the file 
blocks fi-om the file system in disk storage and then checking whether the file blocks have 
become stale due to concurrent writes to the file blocks, and writing to the file system cache a 
file block that has not become stale, and not writing to the file system cache a file block that has 
become stale. 
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2. Claims 5-9. 12-14. 16-21. 23-28. 37-43. 46-48. and 50-54 are patentable under 
35 U.S.C. 103fa) over Burns et al. U.S. Patent 6.925.515 B2 in view of Marcotte U.S. Patent 
6.449.614 Bl . 

The policy of the Patent and Trademark Office has been to follow in each and every case the 
standard of patentability enunciated by the Supreme Court in Graham v. John Deere Co. . 148 
U.S.P.Q. 459 (1966). M.P.E.P. § 2141. As stated by the Supreme Court: 



Under § 103, the scope and content of the prior art are to be determined; 
differences between the prior art and the claims at issue are to be ascertained; and 
the level of ordinary skill in the pertinent art resolved. Against this background, 
the obviousness or nonobviousness of the subject matter is determined. Such 
secondary considerations as commercial success, long feh but unsolved needs, 
failure of others, etc., might be utilized to give light to the circumstances 
surrounding the origin of the subject matter sought to be patented. As indicia of 
obviousness or nonobviousness, these inquiries may have relevancy. 

148 U.S.P.Q.at467. 

The problem that the inventor is trying to solve must be considered in determining whether 
or not the invention would have been obvious. The invention as a whole embraces the structure, 
properties and problems it solves. In re Wright . 848 F.2d 1216, 1219, 6 U.S.P.Q.2d 1959, 1961 
(Fed. Cir. 1988). 
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Burns discloses a distributed file system supporting three kinds of locking systems. A 
first locking system is designed for sequential consistency with write-back caching, typical of 
distributed file systems. A second locking system is provided for sequential consistency with no 
caching for applications that manage their own caches. A third locking system implements a 
weaker consistency model with write-back caching, designed for efficient replication and 
distribution of data. Locks for replication are suitable for serving dynamic data on the Internet 
and other highly-concurrent applications. The selection of the appropriate lock protocol for each 
file is set using the file metadata. Further, a novel locking system is provided for the lock system 
implementing a weak consistency model with write back caching. This system is implemented 
utilizing two whole file locks: a producer lock P and a consumer lock C. Any client can hold a 
consumer lock and when holding a consumer lock can read data and cache data for read. The 
producer lock is only held by a single writer and a writer holding a producer lock can write data, 
allocate and cache data for writing. When a writer performs a write, the write is performed as an 
out-of-place write. An out-of-place write writes the data to a different physical storage location 
than fi-om which it was read. By performing an out-of-place write the old data still exists and is 
available to clients. Once the writer completes the write and releases the producer lock the 
previous data is invalidated and the clients are informed of the new location of the data. Clients 
can then read the new data from storage when needed and the server reclaims the old data blocks. 
(See Bums, Abstract.) 
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Marcotte discloses an interface system and methods for asynchronously updating a share 
resource with locking facility. Tasks make updates requested by calling tasks to a shared 
resource serially in a first come first served manner, atomically, but not necessarily 
synchronously, such that a current task holding an exclusive lock on the shared resource makes 
the updates on behalf of one or more calling tasks queued on the lock. Updates waiting in a 
queue on the lock to the shared resource may be made while the lock is held, and others deferred 
for post processing after the lock is released. Some update requests may also, at the calling 
application's option, be executed synchronously. Provision is made for nested asynchronous 
locking. Data structures (wait elements) describing update requests may queued in a wait queue 
for update requests awaiting execution by a current task, other than the calling task, currently 
holding an exclusive lock on the shared resource. Other queues are provided for queuing data 
structures removed from the wait queue but not yet processed; data structures for requests to 
unlock or downgrade a lock; data structures for requests which have been processed and need to 
be returned to free storage; and data structures for requests that need to be awakened or that 
describe post processing routines that are to be run while the lock is not held. (Marcotte, 
Abstract.) 

As discussed above, various elements of the respective base claims are missing from 
Bums. Marcotte does not disclose these elements missing from Burns, so that the appellants' 
claimed invention does not result from the proposed combination of Burns and Marcotte. 



44 



Serial No.: 10/668.467 
Reinstated Appeal Brief 

Moreover, the appellants' claims define a substantial improvement over Burns and Marcotte by 
providing a new way of handling a write request that changes the allocation of the data blocks to 
a file, and this new way of handling such a write request solves a long-felt need for reducing 
contention during multi-threaded or multi-processor access to a shared file system. 

Appellants respectfully submit that Burns does not address the problem of providing 
concurrent writes for the case of a write operation that changes the allocation of a block of data. 
Instead, Burns col. 10 lines 35-37 says: "The only guarantee multiple concurrent writers need is 
that the data does not change location, hence allocation requires an exclusive lock." Thus, 
appellants rely on their previously submitted evidence of a long-felt but unsolved problem that in 
a shared or parallel file system, the potential speed at which an application may execute is 
impaired by the need for file locking. 

Chang et al. US 2005/0039049, cited by the Examiner, is evidence of a long-felt but 
unsolved problem that in a shared or parallel file system, the potential speed at which an 
application may execute is impaired by the need for file locking. Chang teaches that for "an 
application having its own serialization or locking mechanisms" (Chang, paragraph [0014]), 
"multiple processes may write to the same block of data within the file at approximately the 
same time as long as they are not changing the allocation of the block of data, i.e. either 
allocating the block, deallocating the block of data, or changing the block of data." (Chang, 
paragraph [0015].) The appellants' invention fiirther solves this problem by enabling multiple 
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processes to write to the file at approximately the same time when updating the metadata 
structure associated with the file. The appellants cited additional evidence (on an IDS form filed 
on April 15, 2009) showing that shared or parallel file systems and their associated problems 
have been known for at least a decade prior to the filing of the appellants' patent application. 
See Row et al. U.S. Patent 5,163,131 filed Sept. 8, 1989 entitled Parallel I/O Network File Server 
Architecture, and Hitz et al. U.S. Patent 6,065,037, also having a priority date of Sep. 8, 1989. 
Copies of Chang, Row, and Hitz are included in appellants' Evidence Appendix IX. 

With respect to appellants' dependent claims 5-9, 12, 16-21, 23-24, 37-43, and 50-54, 
these claims depend fi-om the independent claims 1, 15, 33, and 49. Bums has been 
distinguished above with respect to the independent claims 1, 15, 33, and 49, and Marcotte does 
not provide the limitations of these independent claims that are missing fi-om Burns. Therefore 
the dependent claims 5-9, 12, 16-21, 23-24, 37-43, and 50-54 are patentable over the proposed 
combination of Burns and Marcotte. It would not have been obvious for one of ordinary skill to 
combine Burns and Marcotte in the fashion as proposed in the Official Action and then modify 
that combination by adding the missing limitations. 

Claims 5 and 37 

Appellants' dependent claim 5 adds to claim 1 the limitations of "wherein the 
asynchronous writing to the file includes a partial write to a new block that has been copied at 
least in part from an original block of the file, and wherein the method fiirther includes checking 
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a partial block conflict queue for a conflict with a concurrent write to the new block, and upon 
failing to find an indication of a conflict with a concurrent write to the new block, preallocating 
the new block, copying at least a portion of the original block of the file to the new block, and 

performing the partial write to the new block." Appellants' partial block conflict queue 73 is 
shown in appellants' FIG. 4 and described in appellant's specification on page 15 lines 17-22 as 
follows: 

The preallocation method allows concurrent writes to indirect blocks 
within the same file. Multiple writers can write to the same indirect block tree 
concurrently without improper replication of the indirect blocks. Two different 
indirect blocks will not be allocated for replicating the same indirect block. The 
write threads use the partial block conflict queue 73 and the partial write wait 
queue 74 to avoid conflict during partial block write operations, as fiirther 
described below with reference to FIG. 13. 

See also appellants' FIG. 13 and specification page 28 line 22 to page 29 line 7; and FIG. 17 and 
page 34 line 7 to page 35 line 5. 

With respect to appellants' dependent claim 5, pages 8-9 of the final Official Action 
recognizes that Bums fails to explicitly recite the limitations added by dependent claim 5, and 
says that Marcotte teaches these limitations, citing column 13, lines 35 through col. 14, line 43. 
However, Marcotte column 13, lines 35 through col. 14, line 43 deals with managing an I/O 
device holding queue above a device for queuing pending I/O's if the number of I/O's issued to 
the device exceeds a threshold that is adjustable by an application. It is not seen where Marcotte 
discloses a partial write to a new block that has been copied at least in part fi-om an original 
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block of the file. Nor is it seen where Marcotte discloses checking a partial block conflict queue 
for a conflict with a concurrent write to the new block, and upon failing to find an indication of a 
conflict with a concurrent write to the new block, preallocating the new block, copying at least a 
portion of the original block of the file to the new block, and performing the partial write to the 
new block. 

"[R]ejections on obviousness grounds cannot be sustained by mere conclusory 
statements; instead, there must be some articulated reasoning with some rational underpinning to 
support the legal conclusion of obviousness." In re Kahn , 441 F. 3d 977, 988 (Fed. Cir. 2006). 
A fact finder should be aware of the distortion caused by hindsight bias and must be cautious of 

arguments reliant upon ex post reasoning. See KSR International Co. v. Teleflex Inc. . 550 U.S. , 

82 USPQ2d 1385 (2007)), citing Graham . 383 U. S. at 36 (warning against a "temptation to read 
into the prior art the teachings of the invention in issue" and instructing courts to "guard against 
slipping into the use of hindsight."). 

Claims 6 and 38 

Appellants' claim 6 is similar to claim 5 in that it also adds to claim 1 the limitations of 
wherein the asynchronous writing to the file includes a partial write to a new block that has been 
copied at least in part from an original block of the file, and wherein the method further includes 
checking a partial block confiict queue for a conflict with a concurrent write to the new block. 
With respect to dependent claim 6, pages 8-9 of the final Official Action also recognizes that 
Bums fails to explicitly recite the limitations added by dependent claim 6, and says that Marcotte 
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teaches these limitations, again citing Marcotte column 13, lines 35 through col. 14, line 43. 
However, Marcotte column 13, lines 35 through col. 14, line 43 deals with managing an I/O 
device holding queue above a device for queuing pending I/O's if the number of I/O' s issued to 
the device exceeds a threshold that is adjustable by an application. It is not seen where Marcotte 
discloses a partial write to a new block that has been copied at least in part from an original 
block of the file. Nor is it seen where Marcotte discloses checking a partial block conflict queue 
for a conflict with a concurrent write to the new block. 

Claims 12 and 46 

With respect to appellants' dependent claim 12, Bums does not further teach checking 
whether a previous commit is in progress after asynchronously writing to the file and before 
obtaining the lock for the file for committing the metadata block to the file. As discussed above 
with reference to appellants' claim 1, in response to a concurrent write request from a client. 
Burns does not again obtain the lock for the file after the asynchronously writing to the file. Nor 
does Marcotte column 12, line 7 through column 13, line 32 disclose these limitations missing 
from Burns or specifically deal with committing "metadata blocks" to a file. 

Claims 13, 14, 47, and 48 

Appellants' independent claim 13 recites "wherein the asynchronous writing to the file 
includes a partial write to a new block that has been copied at least in part from an original block 
of the file, and wherein the method includes checking a partial block conflict queue for a conflict 
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with a concurrent write to the new block, and upon finding an indication of a conflict with a 
concurrent write to the new block, waiting until resolution of the conflict with the concurrent 
write to the new block, and then performing the partial write to the new block." Thus, 
Appellants' independent claim 13 is patentable over Burns in combination with Marcotte for the 
reasons given above with reference to appellants' claim 5. Marcotte does not discloses a partial 
write to a new block that has been copied at least in part from an original block of the file and 
checking a partial block conflict queue for a conflict with a concurrent write to the new block, 
and upon failing to find an indication of a conflict with a concurrent write to the new block, 
preallocating the new block, copying at least a portion of the original block of the file to the new 
block, and performing the partial write to the new block. 

Claims 16 and 50 

Appellants' dependent claim 16 adds to claim 15 the express limitations of claim 12 and 
therefore is patentable over Burns in combination with Marcotte for the reasons given above with 
reference to appellants' claims 12 and 15. Bums does not fiirther teach checking whether a 
previous commit is in progress after asynchronously writing to the file and before obtaining the 
lock for the file for committing the metadata block to the file. 

Claim 17 

Appellants' dependent claim 17 adds to claim 15 the limitations of "wherein the network 
file server includes disk storage containing a file system, and a file system cache storing data of 
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blocks of the file, and the method further includes the network file server computer responding to 
concurrent write requests by writing new data for specified blocks of the file to the disk storage 
without writing the new data for the specified blocks of the file to the file system cache, and 
invalidating the specified blocks of the file in the file system cache." Pages 13-14 of the Official 
Action recognize that Burns fails to explicitly recite the recited operations of "the network file 
server computer responding to concurrent write requests by writing new data for specified blocks 
of the file to the disk storage without writing the new data for the specified blocks of the file to 
the file system cache, and invalidating the specified blocks of the file in the file system cache." 
Page 14 of the Official Action cites Marcotte column 12 line 7 through column 13, line 32. 
However, the appellants' specific claim limitations are not disclosed in Marcotte column 12 line 
7 through column 13, line 32. See the discussion of Marcotte below with reference to claim 25. 

Claim 18 

Appellants' dependent claim 18 adds to claim 17 the limitations of "the network file 
server computer responding to read requests for file blocks not found in the file system cache by 
reading the file blocks fi-om the file system in disk storage and then checking whether the file 
blocks have become stale due to concurrent writes to the file blocks, and writing to the file 
system cache a file block that has not become stale, and not writing to the file system cache a file 
block that has become stale." Page 14 of the Official Action again cites Marcotte column 12 line 
7 through column 13, line 32. However, the specific limitations of appellants' claim 18 are not 
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disclosed in Marcotte column 12 line 7 through column 13, line 32. See the discussion of 
Marcotte below with reference to claim 25. 

Claim 19 

Appellants' dependent claim 19 adds to claim 18 the limitations of "the network file 
server computer checking a read-in-progress flag for a file block upon finding that the file block 
is not in the file system cache, and upon finding that the read-in-progress flag indicates that a 
prior read of the file block is in progress from the file system in the disk storage, waiting for 
completion of the prior read of the file block from the file system in the disk storage, and then 
again checking whether the file block is in the file system cache." Page 14-15 of the Official 
Action again cites Marcotte column 12 line 7 through column 13, line 32. However, the use of a 
read-in-progress flag as specifically recited in appellants' claim 19 is not disclosed in Marcotte 
column 12 line 7 through column 13, line 32. See the discussion of Marcotte below with 
reference to claim 25. 

Claim 20 

Appellants' dependent claim 20 adds to claim 18 the limitations of "the network file 
server computer setting a read-in-progress flag for a file block upon finding that the file block is 
not in the file system cache and then beginning to read the file block fi-om the file system in disk 
storage, clearing the read-in-progress flag upon writing to the file block on disk, and inspecting 
the read-in-progress flag to determine whether the file block has become stale due a concurrent 
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write to the file block." Page 15 of the Official Action again cites Marcotte column 12 line 7 
through column 13, line 32. However, use of a read-in-progress flag as specifically recited in 
appellants' claim 20 is not disclosed in Marcotte column 12 line 7 through column 13, line 32. 
See the discussion of Marcotte below with reference to claim 25. 

Claim 21 

Appellants' dependent claim 21 adds to claim 18 the limitations of "the network file 
server computer maintaining a generation count for each read of a file block from the file system 
in the disk storage in response to a read request for a file block that is not in the file system 
cache, and checking whether a file block having been read fi-om the file system in the disk 
storage has become stale by checking whether the generation count for the file block having been 
read fi-om the file system is the same as the generation count for the last read request for the 
same file block." Page 15 of the Official Action again cites Marcotte column 12 line 7 through 
column 13, line 32. However, the maintenance of a generation count as specifically recited in 
claim 21 is not disclosed in Marcotte column 12 line 7 through column 13, line 32. See the 
discussion of Marcotte below with reference to claim 25. 

Claims 25 and 51 

With reference to appellants' independent claim 25, page 17 of the Official Action 
recognizes that Bums fails to explicitly recite the network file server responding to concurrent 
write requests by writing new data for specified blocks of the file to disk storage without writing 
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the new data for the specified blocks of the file to the file system cache, and invalidating the 
specified blocks of the file in the file system cache. (See, e.g., appellants' FIG. 10, steps 515 
and 516; appellants' spec, page 23 lines 19-23 and page 24 lines 7-15.) Page 17 of the Official 
Action further recognizes that Burns fails to explicitly recite the network file server responding 
to read requests for file blocks not found in the file system cache by reading the file blocks from 
the file system in disk storage and then checking whether the file blocks have become stale due 
to concurrent writes to the file blocks, and writing to the file system cache a file block that has 
not become state, and not writing to the file system cache a file block that has become stale. 
(See, e.g., appellants' FIG. 10, steps 92, 97, 513, 98; appellants' spec, page 23 lines 5-17; page 
24 lines 16-22.) Page 18 of the Official Action cites Marcotte col. 12 line 7 through column 12, 
line 32 for all of these limitations not explicitly recited in Burns. However, it is not understood 
how all of the appellants' specific claim limitations are disclosed in Marcotte column 12 line 7 
through column 13, line 32. 

Marcotte column 12 line 7 through column 13, hne 32, deals generally with maintaining a 
list of lock waiters. As shown in Marcotte FIG. 9, when a task waits for a lock, it queues its 
WAIT ELEMENT to the lock using lock or wait routine 175 and also adds it WAIT ELEMENT 
to a global list or queue 230 before it waits, and removes it from the global list 230 after it waits. 
As shown in Marcotte FIG. 1 1, do wait 240 is executed each time a thread needs to go into a 
wait for a lock. Step 241 executes a lock UpdateResource procedure. Step 242 waits on ecb; 
and upon receiving it, step 243 executes the lock Update Resource procedure. Marcotte says 
that in this way, the task waiting on a lock 100 will only actually suspend itself on the WAIT 



54 



Serial No.: 10/668.467 
Reinstated Appeal Brief 



call. Adding and removing tasks (WAIT ELEMENTS) from global list 230 is done in a manner 
guaranteed to make the calling task wait. FIG. 12 shows that an add_to_global routine 245 
(with waitp-arg) includes step 246 which determines if prevent flag is null; if so, step 248 posts 
an error code in the ecb field 108 of the WAIT ELEMENT being processed; and, if not, step 247 
adds the WAIT_ELEMENT (in this case, task 232) to the head of global list 230. FIG. 13 shows 
a remove from ^global routine 250 (with waitp=arg) including step 251 which determines if 
prevent flag 124 is null. If so, return code (rc) is set to zero; and if not, this WAIT ELEMENT 
(say, 233) is removed from global list 230. In step 254, the return code (rc) is returned to the 
caller. FIG. 14 shows a retum_wait routine 260 (with waitp=prc) including step 261 which 
determines if waitp is null. If not, the WAIT ELEMENT pointed to by waitp is returned to free 
storage. Return wait 260 is the post processing routine for remove_from_global 250, and 
remove_from_global communicates the address of the WAIT ELEMENT via its return code, 
that is input to return wait (automatically by the resource update facility) as pre. Return wait 
260 returns the WAIT_ELEMENT to free storage. Since routine 260 is a post processing routine, 
the free storage return is NOT performed while holding the lock. This shows the benefits of a 
post processing routine, and passing the return value from a resource update routine to the post 
processing routine. FIG. 15 shows quiesce global 265 wakes up all waiters and in step 267 tells 
them that the program is terminating due to error by way of prevent flag 124 being set to 1 in 
step 266. In step 268 pointer 231 and 234 are cleared so global list 230 is empty. 

Thus, Marcotte's general teaching of a way of maintaining a list of lock waiters fails to 
disclose the appellants' specific limitations of the network file server responding to concurrent 
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write requests by writing new data for specified blocks of the file to disk storage without writing 
the new data for the specified blocks of the file to the file system cache, invalidating the 
specified blocks of the file in the file system cache, and responding to read requests for file 
blocks not found in the file system cache by reading the file blocks from the file system in disk 
storage and then checking whether the file blocks have become stale due to concurrent writes to 
the file blocks, and writing to the file system cache a file block that has not become state, and not 
writing to the file system cache a file block that has become stale. 

"[R]ejections on obviousness grounds cannot be sustained by mere conclusory 
statements; instead, there must be some articulated reasoning with some rational underpinning to 
support the legal conclusion of obviousness." In re Kahn . 441 F. 3d 977, 988 (Fed. Cir. 2006). 
A fact finder should be aware of the distortion caused by hindsight bias and must be cautious of 

arguments reliant upon ex post reasoning. See KSR International Co. v. Telefiex Inc. . 550 U.S. , 

82 USPQ2d 1385 (2007)), citing Graham . 383 U. S. at 36 (warning against a "temptation to read 
into the prior art the teachings of the invention in issue" and instructing courts to "guard against 
slipping into the use of hindsight.") 

Claims 26 and 52 . 

Appellants' dependent claim 26 adds to claim 25 the limitations of "the network file 
server computer checking a read-in-progress fiag for a file block upon finding that the file block 
is not in the file system cache, and upon finding that the read-in-progress flag indicates that a 
prior read of the file block is in progress fi-om the file system in the disk storage, waiting for 
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completion of the prior read of the file block from the file system in the disk storage, and then 
again checking whether the file block is in the file system cache." Page 18 of the Official Action 
again cites Marcotte column 12 line 7 through column 13, line 32. However, the use a read-in- 

progress flag as specifically recited in appellants' claim 26 is not disclosed in Marcotte column 
12 line 7 through column 13, line 32. See the discussion of Marcotte above below with reference 
to claim 25. 

Claims 27 and 53 

Appellants' dependent claim 27 adds to claim 25 the limitations of "the network file 
server computer setting a read-in-progress flag for a file block upon finding that the file block is 
not in the file system cache and then beginning to read the file block from the file system in disk 
storage, clearing the read-in-progress flag upon writing to the file block on disk, and inspecting 
the read-in-progress flag to determine whether the file block has become stale due a concurrent 
write to the file block." Page 19 of the Official Action again cites Marcotte column 12 line 7 
through column 13, line 32. However, the use of a read-in-progress flag as specifically recited in 
appellants' claim 27 is not disclosed in Marcotte column 12 line 7 through column 13, line 32. 
See the discussion of Marcotte below with reference to claim 25. 
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Claims 28 and 54 

Appellants' dependent claim 28 adds to claim 25 the limitations of "the network file 
server computer maintaining a generation count for each read of a file block fi-om the file system 
in the disk storage in response to a read request for a file block that is not in the file system 
cache, and checking whether a file block having been read from the file system in the disk 
storage has become stale by checking whether the generation count for the file block having been 
read fi-om the file system is the same as the generation count for the last read request for the 
same file block." Page 19 of the Official Action again cites Marcotte column 12 line 7 through 
column 13, line 32. However, the maintenance of a generation count as specifically recited in 
appellants' claim 28 is not disclosed in Marcotte column 12 line 7 through column 13, line 32. 
See the discussion of Marcotte below with reference to claim 25. 

3. Claims 59 and 60 are patentable under 35 U.S.C. 103(a) over Burns et al. U.S. 
Patent 6.925.515 B2 in view of Xu et al. US 6.324.581 Bl . 

Appellants' dependent claims 59 and 60 are patentable over the proposed combination of 
Bums and Xu due to the limitations of their independent base claim 58. Bums is distinguished 
fi-om the base claim 58 as discussed above with reference to claim 1 . Xu is distinguished from 
the base claim 58 in a similar fashion Xu fails to disclose appellants' step c) of releasing the 
allocation mutex of the file [prior to the step d) of issuing asynchronous write requests for 
writing to the file], and Xu fails to disclose appellants' step f) of obtaining the allocation mutex 
for the file [afl:er the step d) of issuing asynchronous write requests for writing to the file]. 
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In view of the above, the rejection of the claims should be reversed. 



Respectfully submitted, 



Richard C. Auchterlonie 
Reg. No. 30,607 

NOVAK DRUCE & QUIGG, LLP 
1000 Louisiana, 53"* Floor 
Houston, TX 77002 
713-571-3460 (Telephone) 
713-456-2836 (Telefax) 
Richard. Auchterlonie(^novakdruce. 
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VIII. CLAIMS APPENDIX 

The claims involved in this appeal are as follows: 

1 . A method of operating a network file server computer for providing clients with 
concurrent write access to a file in data storage, the method comprising the network file server 
computer responding to a concurrent write request fi-om a client by: 

(a) obtaining a lock for the file; and then 

(b) preallocating a metadata block for the file; and then 

(c) releasing the lock for the file; and then 

(d) asynchronously writing to the file; and then 

(e) obtaining the lock for the file; and then 

(f) committing the metadata block to the file in the data storage; and then 

(g) releasing the lock for the file, 

2. The method as claimed in claim 1, wherein the file further includes a hierarchy of blocks 
including an inode block of metadata, data blocks of file data, and indirect blocks of metadata, 
and wherein the metadata block for the file is an indirect block of metadata. 
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3. The method as claimed in claim 2, which further includes copying data from an original 
indirect block of the file to the metadata block for the file, the original indirect block of the file 
having been shared between the file and a read-only version of the file. 

4. The method as claimed in claim 1, which fUrther includes concurrent writing for more 
than one client to the metadata block for the file. 

5. The method as claimed in claim 1, wherein the asynchronous writing to the file includes a 
partial write to a new block that has been copied at least in part from an original block of the file, 
and wherein the method further includes checking a partial block conflict queue for a conflict 
with a concurrent write to the new block, and upon failing to find an indication of a conflict with 
a concurrent write to the new block, preallocating the new block, copying at least a portion of the 
original block of the file to the new block, and performing the partial write to the new block. 

6. The method as claimed in claim 1, wherein the asynchronous writing to the file includes a 
partial write to a new block that has been copied at least in part from an original block of the file, 
and wherein the method further includes checking a partial block conflict queue for a conflict 
with a concurrent write to the new block, and upon finding an indication of a conflict with a 
concurrent write to the new block, waiting until resolution of the conflict with the concurrent 
write to the new block, and then performing the partial write to the new block. 
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7. The method as claimed in claim 6, which fUrther includes placing a request for the partial 
write in a partial write wait queue upon finding an indication of a conflict with a concurrent write 
to the new block, and performing the partial write upon servicing the partial write wait queue. 

8. The method as claimed in claim 1, which further includes checking an input-output list 
for a conflicting prior concurrent access to the file, and upon fmding a conflicting prior 
concurrent access to the flle, suspending the asynchronous writing to the file until the conflicting 
prior concurrent access to the file is no longer conflicting. 

9. The method as claimed in claim 8, which further includes providing a sector-level 
granularity of byte range locking for concurrent write access to the file by the suspending of the 
asynchronous writing to the file until the conflicting prior concurrent access is no longer 
conflicting. 

10. The method as claimed in claim 1, which further includes writing the metadata block to 
a log in storage of the network file server computer for committing the metadata block for the 
file. 

11. The method as claimed in claim 1, which further includes gathering together preallocated 
metadata blocks for a plurality of client write requests to the file, and committing together the 
preallocated metadata blocks for the plurality of client write requests to the file by obtaining the 
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lock for the file, committing the gathered preallocated metadata blocks for the plurality of client 
write requests to the file, and then releasing the lock for the file. 

12. The method as claimed in claim 1, which further includes checking whether a previous 
commit is in progress after asynchronously writing to the file and before obtaining the lock for 
the file for committing the metadata block to the file, and upon finding that a previous commit is 
in progress, placing a request for committing the metadata block to the file on a staging queue for 
the file. 

13. A method of operating a network file server computer for providing clients with 
concurrent write access to a file in data storage, the method comprising the network file server 
computer responding to a concurrent write request fi-om a client by: 

(a) preallocating a block for the file; and then 

(b) asynchronously writing to the file; and then 

(c) committing the block to the file in the data storage; 

wherein the asynchronous writing to the file includes a partial write to a new block that 
has been copied at least in part from an original block of the file, and wherein the method 
includes checking a partial block conflict queue for a conflict with a concurrent write to the new 
block, and upon finding an indication of a confiict with a concurrent write to the new block, 
waiting until resolution of the conflict with the concurrent write to the new block, and then 
performing the partial write to the new block. 
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14. The method as claimed in claim 13, wherein the method fUrther includes placing a 

request for the partial write in a partial write wait queue upon finding an indication of a conflict 
with a concurrent write to the new block, and performing the partial write upon servicing the 
partial write wait queue. 

15. A method of operating a network file server computer for providing clients with 
concurrent write access to a file, the method comprising the network file server computer 
responding to a concurrent write request fi-om a client by: 

(a) preallocating a metadata block for the file; and then 

(b) asynchronously writing to the file; and then 

(c) committing the metadata block to the file in the data storage; 

wherein the method includes gathering together preallocated metadata blocks for a 
plurality of client write requests to the file, and committing together the preallocated metadata 
blocks for the plurality of client write requests to the file by obtaining a lock for the file, 
committing the gathered preallocated metadata blocks for the plurality of client write requests to 
the file, and then releasing the lock for the file. 

16. The method as claimed in claim 15, which fiirther includes checking whether a previous 
commit is in progress after asynchronously writing to the file and before obtaining the lock for 
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the file for committing the block to the file, and upon finding that a previous commit is in 
progress, placing a request for committing the metadata block to the file on a staging queue for 
the file. 

17. The method as claimed in claim 15, wherein the network file server computer includes 
disk storage containing a file system, and a file system cache storing data of blocks of the file, 
and the method further includes the network file server computer responding to concurrent write 
requests by writing new data for specified blocks of the file to the disk storage without writing 
the new data for the specified blocks of the file to the file system cache, and invalidating the 
specified blocks of the file in the file system cache. 

18. The method as claimed in claim 17, which further includes the network file server 
computer responding to read requests for file blocks not found in the file system cache by 
reading the file blocks from the file system in disk storage and then checking whether the file 
blocks have become stale due to concurrent writes to the file blocks, and writing to the file 
system cache a file block that has not become stale, and not writing to the file system cache a file 
block that has become stale. 

19. The method as claimed in claim 18, which further includes the network file server 
computer checking a read-in-progress flag for a file block upon finding that the file block is not 
in the file system cache, and upon finding that the read-in-progress flag indicates that a prior read 
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of the file block is in progress from the file system in the disk storage, waiting for completion of 
the prior read of the file block from the file system in the disk storage, and then again checking 
whether the file block is in the file system cache. 

20. The method as claimed in claim 18, which fiirther includes the network file server 
computer setting a read-in-progress flag for a file block upon finding that the file block is not in 
the file system cache and then beginning to read the file block from the file system in disk 
storage, clearing the read-in-progress flag upon writing to the file block on disk, and inspecting 
the read-in-progress flag to determine whether the file block has become stale due to a 
concurrent write to the file block. 

21 . The method as claimed in claim 18, which fiirther includes the network file server 
computer maintaining a generation count for each read of a file block from the file system in the 
disk storage in response to a read request for a file block that is not in the file system cache, and 
checking whether a file block having been read from the file system in the disk storage has 
become stale by checking whether the generation count for the file block having been read from 
the file system is the same as the generation count for the last read request for the same file 
block. 

22. The method as claimed in claim 15, which further includes processing multiple 
concurrent read and write requests by pipelining the requests through a first processor and a 
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second processor, the first processor performing metadata management for the muhiple 
concurrent read and write requests, and the second processor performing asynchronous reads and 
writes for the multiple concurrent read and write requests. 

23. The method as claimed in claim 15, which further includes serializing the reads by 
delaying access for each read to a block that is being written to by a prior, in-progress write until 
completion of the write to the block that is being written to by the prior, in-progress write. 

24. The method as claimed in claim 15, which further includes serializing the writes by 
delaying access for each write to a block that is being accessed by a prior, in-progress read or 
write until completion of the read or write to the block that is being accessed by the prior, in- 
progress read or write. 

25. A method of operating a network file server computer for providing clients with 
concurrent read and write access to a file in data storage, the method comprising the network file 
server computer responding to a concurrent write request fi-om a client by: 

(a) preallocating a metadata block for the file; and then 

(b) asynchronously writing to the file; and then 

(c) committing the metadata block to the file in the data storage; 

wherein the network file server computer includes disk storage containing a file system, 
and a file system cache storing data of blocks of the file, and the method includes the network 
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file server computer responding to concurrent write requests by writing new data for specified 
blocks of the file to the disk storage without writing the new data for the specified blocks of the 
file to the file system cache, and invalidating the specified blocks of the file in the file system 

cache, and 

which includes the network file server computer responding to read requests for file 
blocks not found in the file system cache by reading the file blocks fi"om the file system in disk 
storage and then checking whether the file blocks have become stale due to concurrent writes to 
the file blocks, and writing to the file system cache a file block that has not become stale, and not 
writing to the file system cache a file block that has become stale. 

26. The method as claimed in claim 25, which further includes the network file server 
computer checking a read-in-progress fiag for a file block upon finding that the file block is not 

in the file system cache, and upon finding that the read-in-progress flag indicates that a prior read 
of the file block is in progress from the file system in the disk storage, waiting for completion of 
the prior read of the file block, and then again checking whether the file block is in the file 
system cache. 

27. The method as claimed in claim 25, which fiarther includes the network file server 
computer setting a read-in-progress fiag for a file block upon finding that the file block is not in 
the file system cache and then beginning to read the file block from the file system in disk 
storage, clearing the read-in-progress flag upon writing to the file block on disk, and inspecting 
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the read-in-progress flag to determine whether the file block has become stale due to a 
concurrent write to the file block. 

28. The method as claimed in claim 25, which further includes the network file server 
computer maintaining a generation count for each read of a file block from the file system in the 
disk storage in response to a read request for a file block that is not in the file system cache, and 
checking whether a file block having been read from the file system in the disk storage has 
become stale by checking whether the generation count for the file block having been read from 
the file system is the same as the generation count for the last read request for the same file 
block. 

32. A method of operating a network file server computer for providing clients with 

concurrent write access to a file in data storage, the method comprising the network file server 
computer responding to a concurrent write request from a client by executing a write thread, 
execution of the write thread including: 

(a) obtaining an allocation mutex for the file; and then 

(b) preallocating new metadata blocks that need to be allocated for writing to the file; and 

then 

(c) releasing the allocation mutex for the file; and then 

(d) issuing asynchronous write requests for writing to the file; 
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(e) waiting for callbacks indicating completion of the asynchronous write requests; and 

then 

(f) obtaining the allocation mutex for the file; and then 

(g) committing the preallocated metadata blocks to the file in the data storage; and then 

(h) releasing the allocation mutex for the file. 

33. A network file server comprising storage for storing a file, and at least one processor 
coupled to the storage for providing clients with concurrent write access to the file, wherein the 
network file server is programmed for responding to a concurrent write request from a client by: 

(a) obtaining a lock for the file; and then 

(b) preallocating a metadata block for the file; and then 

(c) releasing the lock for the file; and then 

(d) asynchronously writing to the file; and then 

(e) obtaining the lock for the file; and then 

(f) committing the metadata block to the file; and then 

(g) releasing the lock for the file. 

34. The network file server as claimed in claim 33, wherein the file further includes a 
hierarchy of blocks including an inode block of metadata, data blocks of file data, and indirect 
blocks of metadata, and wherein the metadata block for the file is an indirect block of metadata. 
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35. The network file server as claimed in claim 34, which is further programmed for copying 
data fi-om an original indirect block of the file to the metadata block for the file, the original 
indirect block of the file having been shared between the file and a read-only version of the file. 

36. The network file server as claimed in claim 33, which is further programmed for 
concurrent writing for more than one client to the metadata block for the file. 

37. The network file server as claimed in claim 33, which further includes a partial block 
conflict queue for indicating a concurrent write to a new block that is being copied at least in part 
fi-om an original block of the file, and wherein the network file server is further programmed to 
respond to a client request for a partial write to the new block by checking the partial block 
conflict queue for a conflict, and upon failing to find an indication of a conflict, preallocating the 
new block, copying at least a portion of the original block of the file to the new block, and 
performing a partial write to the new block. 

38. The network file server as claimed in claim 33, which further includes a partial block 
conflict queue for indicating a concurrent write to a new block that is being copied at least in part 
from an original block of the file, and wherein the network file server is further programmed to 
respond to a client request for a partial write to the new block by checking the partial block 
conflict queue for a conflict, and upon finding an indication of a conflict, waiting until resolution 
of the conflict with the concurrent write to the new block, and then performing the partial write 
to the new block. 
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39. The network file server as claimed in claim 38, which further includes a partial write wait 
queue, and wherein the network file server is further programmed for placing a request for the 

partial write in the partial write wait queue upon finding an indication of a conflict, and 
performing the partial write upon servicing the partial write wait queue. 

40. The network file server as claimed in claim 33, which is further programmed for 
maintaining an input-output list of concurrent reads and writes to the file, and when writing to 
the file, for checking the input-output list for a conflicting prior concurrent read or write access 
to the file, and upon finding a conflicting prior concurrent read or write access to the file, 
suspending the asynchronous writing to the file until the conflicting prior concurrent read or 
write access to the file is no longer confiicting. 

41. The network file server as claimed in claim 40, which is further programmed so that the 
suspending of the asynchronous writing to the file until the conflicting prior concurrent read or 
write access to the file is no longer confiicting provides a sector-level granularity of byte range 
locking for concurrent write access to the file. 

42. The network file server as claimed in claim 33, which is further programmed for 
maintaining an input-output list of concurrent reads and writes to the file, and when reading from 
the file, for checking the input-output list for a conflicting prior concurrent write access to the 
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file, and upon finding a conflicting prior concurrent write access to the file, suspending the 
reading to the file until the conflicting prior concurrent write access to the file is no longer 
confiicting. 

43. The network file server as claimed in claim 42, which is further programmed so that the 
suspending of the reading to the file until the confiicting prior concurrent write access to the file 
is no longer confiicting provides a sector-level granularity of byte range locking for concurrent 
read access to the file. 

44. The network file server as claimed in claim 33, which is further programmed for 
committing the metadata block for the file by writing the metadata block to a log in the storage. 

45. The network file server as claimed in claim 33, which is further programmed for 
gathering together preallocated metadata blocks for a plurality of client requests for write access 
to the file, and committing together the preallocated metadata blocks for the plurality of client 
requests for access to the file by obtaining the lock for the file, committing the gathered 
preallocated metadata blocks for the plurality of client requests for write access to the file, and 
then releasing the lock for the file. 

46. The network file server as claimed in claim 33, which further includes a staging queue for 
the file, and which is fiarther programmed for checking whether a previous commit is in progress 

73 



Serial No.: 10/668.467 
Reinstated Appeal Brief 

after asynchronously writing to the file and before obtaining the lock for the file for committing 
the metadata block to the file, and upon finding that a previous commit is in progress, placing a 
request for committing the metadata block to the file on the staging queue for the file. 

47. A network file server comprising storage for storing a file, and at least one processor 
coupled to the storage for providing clients with concurrent write access to the file, wherein the 
network file server is programmed for responding to a concurrent write request fi-om a client by: 

(a) preallocating a block for the file; and then 

(b) asynchronously writing to the file; and then 

(c) committing the block to the file; 

wherein the network file server includes a partial block conflict queue for indicating a 
concurrent write to a new block that is being copied at least in part fi-om an original block of the 
file, and wherein the network file server is programmed for responding to a client request for a 
partial write to the new block by checking the partial block conflict queue for a conflict, and 
upon finding an indication of a conflict, waiting until resolution of the conflict with the 
concurrent write to the new block of the file, and then performing the partial write to the new 
block of the file. 

48. The network file server as claimed in claim 47, which fiarther includes a partial write wait 
queue, and wherein the network file server is programmed for placing a request for the partial 
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write in the partial write wait queue upon finding an indication of a conflict, and performing the 
partial write upon servicing the partial write wait queue. 

49. A network file server comprising storage for storing a file, and at least one processor 
coupled to the storage for providing clients with concurrent write access to the file, wherein the 
network file server is programmed for responding to a concurrent write request from a client by: 

(a) preallocating a metadata block for the file; and then 

(b) asynchronously writing to the file; and then 

(c) committing the metadata block to the file; 

wherein the network file server is programmed for gathering together preallocated 
metadata blocks for a plurality of client write requests to the file, and committing together the 
preallocated metadata blocks for the plurality of client write requests to the file by obtaining a 
lock for the file, committing the gathered preallocated metadata blocks for the plurality of client 
write requests to the file, and then releasing the lock for the file. 

50. The network file server as claimed in claim 49, which is fiirther programmed for 
checking whether a previous commit is in progress after asynchronously writing to the file and 
before obtaining the lock for the file for committing the metadata block to the file, and upon 
finding that a previous commit is in progress, placing a request for committing the metadata 
block to the file on a staging queue for the file. 
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51. A network file server comprising disk storage containing a file system, and a file system 
cache storing data of blocks of a file in the file system, wherein the network file server is 
programmed for responding to a concurrent write request fi-om a client by: 

(a) preallocating a metadata block for the file; and then 

(b) asynchronously writing to the file; and then 

(c) committing the metadata block to the file; 

wherein the network file server is further programmed for responding to concurrent write 
requests by writing new data for specified blocks of the file to the disk storage without writing 
the new data for the specified blocks of the file to the file system cache, and invalidating the 
specified blocks of the file in the file system cache, and 

wherein the network file server is programmed for responding to concurrent read requests 
for file blocks not found in the file system cache by reading the file blocks fi-om the file system 
in disk storage and then checking whether the file blocks have become stale due to concurrent 
writes to the file blocks, and writing to the file system cache a file block that has not become 
stale, and not writing to the file system cache a file block that has become stale. 

52. The network file server as claimed in claim 51, which is further programmed for 
checking a read-in-progress fiag for a file block upon finding that the file block is not in the file 
system cache, and upon finding that the read-in-progress flag indicates that a prior read of the 
file block is in progress fi-om the file system in the disk storage, waiting for completion of the 
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prior read of the file block, and then again checking whether the file block is in the file system 
cache. 

53. The network file server as claimed in claim 51, which is further programmed for setting a 
read-in-progress flag for a file block upon finding that the file block is not in the file system 
cache and then beginning to read the file block from the file system in disk storage, clearing the 
read-in-progress flag upon writing to the file block on disk, and inspecting the read-in-progress 
flag to determine whether the file block has become stale due to a concurrent write to the file 
block. 

54. The network file server as claimed in claim 51, which is fiirther programmed for 
maintaining a generation count for each read of a file block from the file system in the disk 
storage in response to a read request for a file block that is not in the file system cache, and 
checking whether a file block having been read from the file system in the disk storage has 
become stale by checking whether the generation count for the file block having been read from 
the file system is the same as the generation count for the last read request for the same file 
block. 

58. A network file server comprising storage for storing a file, and at least one processor 
coupled to the storage for providing clients with concurrent write access to the file, wherein the 
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network file server is programmed with a write thread for responding to a concurrent write 
request from a client by: 

(a) obtaining an allocation mutex for the file; and then 

(b) preallocating new metadata blocks that need to be allocated for writing to the file; and 

then 

(c) releasing the allocation mutex for the file; and then 

(d) issuing asynchronous write requests for writing to the file; 

(e) waiting for callbacks indicating completion of the asynchronous write requests; and 

then 

(f) obtaining the allocation mutex for the file; and then 

(g) committing the preallocated metadata blocks; and then 

(h) releasing the allocation mutex for the file. 

59. The network file server as claimed in claim 58, which further includes an uncached write 
interface, a file system cache and a cached read-write interface, and wherein the uncached write 
interface bypasses the file system cache for sector-aligned write operations. 

60. The network file server as claimed in claim 59, wherein the network file server is further 
programmed to invalidate cache blocks in the file system cache including sectors being written to 
by the uncached write interface. 
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61. A network file server comprising storage for storing a file, and at least one processor 
coupled to the storage for providing clients with concurrent write access to the file, wherein the 
network file server is programmed for responding to a concurrent write request fi-om a client by: 

(a) preallocating a block for writing to the file; 

(b) asynchronously writing to the file; and then 

(c) committing the preallocated block; 

wherein the network file server also includes an uncached write interface, a file system cache, 
and a cached read-write interface, wherein the uncached write interface bypasses the file system 
cache for sector-aligned write operations, and the network file server is programmed to 
invalidate cache blocks in the file system cache including sectors being written to by the 
uncached write interface. 

62. The method as claimed in claim 1, which further includes a final step of returning to said 
client an acknowledgement of the writing to the file. 

63. The method as claimed in claim 13, which further includes a final step of returning to 
said client an acknowledgement of the writing to the file. 

64. The method as claimed in claim 15, which further includes a final step of returning to 
said client an acknowledgement of the writing to the file. 
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65. The method as claimed in claim 25, which fUrther includes a final step of returning to 
said client an acknowledgement of the writing to the file. 

67. The method as claimed in claim 32, which fiarther includes a final step of returning to 
said client an acknowledgement of the writing to the file. 

68. The method as claimed in claim 1, which further includes a final step of saving the file in 
disk storage of the network file server. 

69. The method as claimed in claim 13, which further includes a final step of saving the file 
in disk storage of the network file server. 

70. The method as claimed in claim 15, which further includes a final step of saving the file 

in disk storage of the network file server. 

71. The method as claimed in claim 25, which further includes a final step of saving the file 
in the disk storage. 



73. The method as claimed in claim 32, which further includes a final step of saving the file 
in disk storage of the network file server. 
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IX. EVIDENCE APPENDIX 

Attached are copies of the following evidence: 

1. Chang et al, U.S. Patent Application Publication US 2005/0039049 Al published 
Feb. 17, 2005, cited by the Examiner as Ref A on the Notice of References Cited PTO-892 
attached to the Official Action dated Jan 13, 2009. 

2. Row et al. U.S. Patent 5, 163, 13 1 issued Nov. 10, 1992, cited by the Appellants as Ref 
1 in the Information Disclosure Statement filed April 15, 2009. 

3. Hitz et al. U.S. Patent 6,065,037 issued May 16, 2000, cited by the Appellants as Ref 
6 in the Information Disclosure Statement filed April 15, 2009. 
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(57) ABSTRACT 

A method and apparatus for a multiple concurrent writer file 
system are provided. With the method and apparatus, the 
metadata of a file includes a read lock, a write lock and a 
concurrent writer flag. If the concurrent writer flag is set, the 
file allows for multiple writers. That is, multiple processes 
may write to the same block of data within the file at 
approximately the same time as long as they are not chang- 
ing the allocation of the block of data, i.e. either allocating 
the block, deallocating the block of data, or changing the 
size of the block of data. Multiple writers is facilitated by 
allowing processes performing write operations that do not 
require or result in a change to the allocation of data blocks 
in a file to use the read lock of a file rather than the write lock 
of the file. Software seriahzation or integrity mechanisms 
may be used to govern the manner by which these concur- 
rent write operations have their results reflected in the file 
structure. ITiose processes performing write operations that 
do require or result in a change in the allocation of data 
blocks in a file must still acquire the write lock before 
performing their operation. 
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METHOD AND APPARATUS FOR A MULTIPLE 
CONCURRENT WRITER FILE SYSTEM 

BACKGROUND OF THE INVENTION 
[0001] 1. Technical Field 

[0002] The present invention is generally directed to an 
improved file system for a data processing system. More 
specifically, the present invention is directed to a local file 
system that permits multiple concurrent readers and writers. 
[0003] 2. Description of Related Art 

[0004] A file system is a computer program that allows 
other application programs to store and retrieve data on 
media such as disk drives. A file is a named collection of 
related information that is recorded on a storage medium, 
e.g., a magnetic disk. The file system allows application 
programs to create files, give them names, store (or write) 
data into them, to read data from them, delete them, and 
perform other operations on them. In general, a file structure 
is the organization of data on the disk drives. In addition to 
the file data itself, the file structure contains metadata: a 
directory that maps file names to the corresponding files, file 
metadata that contains information about the file, most 
importantly the location of the file data on the disk (i.e. 
which disk blocks hold the file data), an allocation map that 
records which disk blocks are currently in use to store 
metadata and file data, and a superblock that contains overall 
information about the file structure (e.g., the locations of the 
directory, allocation map, and other metadata structures). 

[0005] File systems may be localized, such as a file system 
for a particular computing device, or distributed such that a 
plurality of computing devices have access to shared stor- 
age, e.g., a shared disk file system. In both cases, it is 
important to ensure the integrity of the file structure 
accessed by the file system so that corruption of data is not 
permitted. This is typically performed by governing the 
computing devices and/or applications that may read or 
write to the files of the file structure. 

[0006] Consider a file structure stored on N disks, DO, Dl, 
. . . , DN-1. Each disk block in the file structure is identified 
by a pair (i,j), e.g., (5, 254) identifies the 254"" block on disk 
D5. The allocation map is typically stored in an array A, 
where the value of element A(i,j) denotes the allocation state 
(allocated/free) of disk block (i,j). 

[0007] The allocation map is typically stored on disk as 
part of the file structure, residing in one or more disk blocks. 
Conventionally, A(i,j) is the kth sequential element in the 
map, where k=iM+j, and M is some constant greater than the 
largest block number on any disk. 

[0008] To find a free block of disk space, the file system 
reads a block of A into a memory buffer and searches the 
buffer to find an element (A(i,j) whose value indicates that 
the corresponding block (i j) is free. Before using block (i,j), 
the file system updates the value of A(i,j) in the buffer to 
indicate that the state of the block (i,j) is allocated, and 
writes the buffer back to disk. To free a block (i,j) that is no 
long needed, the file system reads the block containing A(i,j) 
into a buffer, updates the value of A(i,j) to denote that block 
(i,j) is free, and writes the block from the buffer back to disk. 
[0009] If the nodes comprising a shared disk file system, 
or a plurality of appHcations on a single computing device. 



do not properly synchronize their access to the shared 
storage, they may corrupt the file structure. This applies in 
particular to the allocation map. To illustrate this, consider 
the process of allocating a free block described above. 
Suppose two nodes simultaneously attempt to allocate a 
block. In the process of doing this, they could both read the 
same allocation map block, both find the same element A(ij) 
describing fi-ee block (i,j), both update A(i,j) to show block 
(i,j) as allocated, both write the block back to disk, and both 
proceed to use block (i,j) for different purposes, thus vio- 
lating the integrity of the file structure. 

[0010] A more subtle but just as serious problem occurs 
even if the nodes simultaneously allocate different blocks X 
and Y, if A(X) and A(Y) arc both contained in the same map 
block. In this case, the first node sets A(X) to allocated, the 
second node sets A(Y) to allocated, and both simultaneously 
write their buffered copies of the map block to disk. Depend- 
ing on which write is done first, either block X or Y will 
appear free in the map on the disk. If, for example, the 
second node's write is executed after the first node's write, 
block X will be free in the map on disk. The first node will 
proceed to use block X (e.g., to store a data block on a file), 
but at some time later another node could allocate block X 
for some other purpose, again with the result of violating the 
integrity of the file structure. 

[0011] In order to ensiu-e the integrity of the file structure, 
many file systems make use of an integrity manager or 
conciu-rency management mechanism that determines how 
to govern reads and writes to the storage device. The most 
widely used mechanism is a locking mechanism in which 
processes must obtain a lock on a block of data in order to 
access the block of data. For example, a block of data may 
have a read lock and a write lock. Any number of processes 
may obtain the read lock concurrently and thus, be able to 
read the data in the block at approximately the same time. 
However, only one process may obtain the write lock at any 
one time. Thus, multiple concurrent readers are possible but 
only one writer is permitted at any one time. This ensures 
that two or more processes cannot write to the same block 
of data at the same time, such as in the situation previously 
discussed. 

[0012] Some computer applications also provide for their 
own seriahzation or locking of blocks of data. For example, 
databases typically include integrity management mecha- 
nisms for ensuring that the integrity of the records within the 
database is maintained. These application based integrity 
management mechanisms manage reads and writes to 
records of the database so that the database is not corrupted. 

[0013] An example of such an integrity management 
mechanism is the two-phase commit. In the two-phase 
commit, a prepare phase is followed by a commit phase. In 
the prepare phase, a global coordinator (initiating database) 
requests that all participants (distributed databases) agree to 
commit or rollback a transaction. In the subsequent commit 
phase, all participants respond to the coordinator that they 
are prepared and then the coordinator requests all nodes to 
commit the transaction. If all participants cannot prepare or 
there is a system component failure, the coordinator asks all 
databases to rollback the transaction. 

[0014] In situations where an application, such as a data- 
base, provides for its own serialization or locking, there is no 
need for the file system to limit the number of concurrent 
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writers to a single writer in order to avoid corruption of the 
file structure. In fact, in some situations, the potential speed 
at which the application may execute is impaired by the 
limitations of the file system. Thus, it would be beneficial to 
remove the limitations of the file system with regard to 
concurrent writers when the file in question is associated 
with an application having its own serialization or locking 
mechanisms. 

SUMMARY OF THE INVENTION 

[0015] The present invention provides a method and appa- 
ratus for a multiple concurrent reader/writer file system. 
With the method and apparatus of the present invention, the 
metadata of a file includes a read lock, a write lock, and a 
concurrent writer flag. If the concurrent writer flag is set, the 
file allows for multiple writers. In other words, multiple 
processes may write to the same block of data within the file 
at approximately the same time as long as they are not 
changing the allocation of the block of data, i.e. either 
allocating the block, deallocating the block of data, or 
changing the size of the block of data. 

[0016] With the method and apparatus of the present 
invention, when an access request, e.g., a write or a read 
operation, is received for one or more data blocks of a file, 
a determination is first made as to whether the access request 
is a read request. If the access request is a read request, the 
reader lock of the file is obtained by the process sending the 
access request. Any number of processes may acquire the 
reader lock of a file at approximately the same time such that 
multiple concurrent readers are allowed. 
[0017] If the access request is not a read access request, 
then the access request is determined to be a write access 
request. A determination is made as to whether the file 
permits multiple conciurent writers by determining the value 
of the concurrent writer flag in the metadata for the file. If 
the concurrent writer flag is set, then the file permits multiple 
concurrent writers. If the concurrent writer flag is not set, 
then the file does not permit multiple concurrent writers. If 
it is determined that multiple concurrent writers is not 
permitted, i.e. the concurrent writers flag is not set, then the 
process must obtain the writer lock to gain access to the file. 
Only one process may acquire the write lock at a time and 
thus, any subsequent process requesting write access to the 
file and needing to obtain the write lock will spin on the lock 
until it is released by the process that currently has acquired 
it. This also prevents readers from accessing the file. Thus, 
while there is a reader lock writers wiU spin on the lock and 
while there is a writer lock readers will spin on the lock. 

[0018] If the file permits concurrent writers, i.e. the con- 
current writer flag is set, then a determination is made as to 
whether the write access request is a write access request 
that intends to change the aUocation of one or more blocks 
of the file. That is, if the write access request wiU result in 
a change in the size of the file either by allocating new data 
blocks to the file, deallocating existing blocks in the file, or 
changing the size of the existing blocks. If the write access 
request is one that will require or result in a change to the 
aUocation of the data blocks of the file, then the write lock 
must be acquired by this process. 

[0019] One situation in which a write access request wiU 
change the allocation of the data blocks of the file is when 
a file is extended, i.e. the request is a request to write to an 



oflEset that is greater than the current file size. Another 
situation where a write access request wiU change the 
aUocation of the data blocks is when the file is truncated. 
Both of these situations require an update to the metadata 
structure associated with the file. 

[0020] Another situation that results in a change to the 
metadata structure of the file is when an input/output request 
on the file violates the alignment or length restrictions of 
direct input/output. That is, the use of concurrent input/ 
output preferably makes certain alignment and length 
restrictions that are to be adhered to by the application's 1/0 
requests. By creating file systems with an appropriate block 
size, e.g., by specifying an aggregate block size equal to 512 
kb at file system creation, such applications can benefit from 
the use of conciurent I/O without any modifications to the 

[0021] If the write access request does not require or result 
in a change in the aUocation of data blocks of the file, then 
the process acquires a read lock of the file and performs its 
write operations using the read lock. It should be noted that 
the read lock does not prevent write operations from being 
performed on the file. Since multiple processes may acquire 
the read lock on the file at approximately the same time, 
there may be multiple concurrent readers and writers to the 
file at approximately the same time as long as the writers are 
not changing the allocation of the file. 
[0022] Because the present invention is intended to be 
used in conjunction with applications that have their own 
serialization of changes to data blocks, e.g., a database 
appUcation, the permitting of multiple writer processes does 
not degrade the integrity of the file structure. That is, the 
present invention removes the requirement that the file 
system ensure integrity by always permitting only one writer 
process at a time and allows the application to use its 
seriali/ation mechanisms to govern how changes to blocks 
of data are to be committed. Only when actual changes to 
aUocations are being made does the file system of the present 
invention limit changes to aUocations to only one writer 
process at a time. 

[0023] These and other features and advantages of the 
present invention will be described in, or will become 
apparent to those of ordinary skill in the art in view of, the 
following detailed description of the preferred embodi- 

BRIEF DESCRIPTION OF THE DRAWINGS 
[0024] The novel features believed characteristic of the 
invention are set forth in the appended claims. The invention 
itself, however, as weU as a preferred mode of use, further 
objectives and advantages thereof, wiU best be understood 
by reference to the following detailed description of an 
illustrative embodiment when read in conjunction with the 
accompanying drawings, wherein: 

[0025] FIG. 1 is an exemplary diagram of a distributed 
data processing system in accordance with the present 

[0026] FIG. 2 is an exemplary diagram of a server com- 
puting device in which the present invention may be imple- 
mented; 

[0027] FIG. 3 is an exemplary diagram of a cUent com- 
puting device in which the present invention may be imple- 
mented; 
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[0028] FIG. 4A is an exemplary diagram illustrating the 
acquiring of locks with regard to a write access request that 
requires a change in allocation of data blocks for a file in 
accordance with the present invention; 

[0029] FIG. 4B is an exemplary diagram illustrating the 
acquiring of locks with regard to a write access request that 
does not change the allocation of data blocks for a file in 
accordance with the present invention; and 

[0030] FIG. 5 is a flowchart outlining an exemplary 
operation of the present invention. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 

[0031] The present invention provides a method and appa- 
ratus for allowing multiple concurrent writer processes to 
the same file. The present invention may be implemented in 
a stand alone computing device or in a distributed data 
processing system. For example, the present invention may 
be implemented by a server computing device, a client 
computing device, a stand alone computing device, or a 
combination of a server computing device and a client 
computing device. Therefore, a brief description of a dis- 
tributed data processing system and stand alone computing 
device are described hereafter in order to provide a context 
for the operations of the present invention described there- 

[0032] With reference now to the figures, FIG. 1 depicts 
a pictorial representation of a network of data processing 
systems in which the present invention may be imple- 
mented. Network data processing system 100 is a network of 
computers in which the present invention may be imple- 
mented. Network data processing system 100 contains a 
network 102, which is the medium used to provide commu- 
nications links between various devices and computers 
connected together within network data processing system 
100. Network 102 may include connections, such as wire, 
wireless commimication links, or fiber optic cables. 

[0033] In the depicted example, server 104 is connected to 
network 102 along with storage unit 106. In addition, clients 
108, 110, and 112 are connected to network 102. These 
chents 108, 110, and 112 may be, for example, personal 
computers or network computers. In the depicted example, 
server 104 provides data, such as boot files, operating 
system images, and applications to clients 108-112. Clients 
108, 110, and 112 are clients to server 104. Network data 
processing system 100 may include additional servers, ch- 
ents, and other devices not shown. In the depicted example, 
network data processing system 100 is the Internet with 
network 102 representing a worldwide collection of net- 
works and gateways that use the Transmission Control 
Protocol/Internet Protocol (TCP/IP) suite of protocols to 
communicate with one another. At the heart of the Internet 
is a backbone of high-speed data communication lines 
between major nodes or host computers, consisting of thou- 
sands of commercial, government, educational and other 
computer systems that route data and messages. Of course, 
network data processing system 100 also may be imple- 
mented as a number of different types of networks, such as 
for example, an intranet, a local area network (LAN), or a 
wide area network (WAN). FIG. 1 is intended as an 
example, and not as an architectural limitation for the 
present invention. 



[0034] Referring to FIG. 2, a block diagram of a data 
processing system that may be implemented as a server, such 
as server 104 in FIG. 1, is depicted in accordance with a 
preferred embodiment of the present invention. Data pro- 
cessing system 200 may be a symmetric multiprocessor 
(SMP) system including a plurality of processors 202 and 
204 connected to system bus 206. Alternatively, a single 
processor system may be employed. Also connected to 
system bus 206 is memory controller/cache 208, which 
provides an interface to local memory 209. 1/0 bus bridge 
210 is connected to system bus 206 and provides an interface 
to I/O bus 212. Memory controller/cache 208 and I/O bus 
bridge 210 may be integrated as depicted. 

[0035] Peripheral component interconnect (PCI) bus 
bridge 214 coimected to I/O bus 212 provides an interface to 
PCI local bus 216. A number of modems may be connected 
to PCI local bus 216. Typical PCI bus implementations will 
support four PCI expansion slots or add-in connectors. 
Communications links to clients I08-II2 in FIG. 1 may be 
provided through modem 218 and network adapter 220 
connected to PCI local bus 216 through add-in boards. 

[0036] Additional PCI bus bridges 222 and 224 provide 
interfaces for additional PCI local buses 226 and 228, from 
which additional modems or network adapters may be 
supported. In this manner, data processing system 200 
allows connections to multiple network computers. A 
memory-mapped graphics adapter 230 and hard disk 232 
may also be connected to I/O bus 212 as depicted, either 
directly or indirectly. 

[0037] Those of ordinary skill in the art will appreciate 
that the hardware depicted in FIG. 2 may vary. For example, 
other peripheral devices, such as optical disk drives and the 
like, also may be used in addition to or in place of the 
hardware depicted. The depicted example is not meant to 
imply architectural limitations with respect to the present 
invention. 

[0038] The data processing system depicted in FIG. 2 may 
be, for example, an IBM eServer pSeries system, a product 
of International Business Machines Corporation in Armonk, 
N.Y., running the Advanced Interactive Executive (AIX) 
operating system or LINUX operating system. 

[0039] With reference now to FIG. 3, a block diagram 
illustrating a data processing system is depicted in which the 
present invention may be implemented. Data processing 
system 300 is an example of a client computer or a stand 
alone computing device. Data processing system 300 
employs a peripheral component interconnect (PCI) local 
bus architecture. Although the depicted example employs a 
PCI bus, other bus architectures such as Accelerated Graph- 
ics Port (AGP) and Industry Standard Architecture (ISA) 
may be used. Processor 302 and main memory 304 are 
connected to PCI local bus 306 through PCI bridge 308. PCI 
bridge 308 also may include an integrated memory control- 
ler and cache memory for processor 302. Additional con- 
nections to PCI local bus 306 may be made through direct 
component interconnection or through add-in boards. In the 
depicted example, local area network (LAN) adapter 310, 
SCSI host bus adapter 312, and expansion bus interface 314 
are connected to PCI local bus 306 by direct component 
connection. In contrast, audio adapter 316, graphics adapter 
318, and audio/video adapter 319 are connected to PCI local 
bus 306 by add-in boards inserted into expansion slots. 
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Expansion bus interface 314 provides a connection for a 
keyboard and mouse adapter 320, modem 322, and addi- 
tional memory 324. Small computer system interface (SCSI) 
host bus adapter 312 provides a connection for hard disk 
drive 326, tape drive 328, and CD-ROM drive 330. Typical 
PCI local bus implementations will support three or four PCI 
expansion slots or add-in connectors. 

[0040] An operating system runs on processor 302 and is 
used to coordinate and provide control of various compo- 
nents within data processing system 300 in FIG. 3. The 
operating system may be a commercially available operating 
system, such as Windows XP, which is available from 
Microsoft Corporation. An object oriented programming 
system such as Java may run in conjunction with the 
operating system and provide calls to the operating system 
from Java programs or applications executing on data pro- 
cessing system 300. "Java" is a trademark of Sun Micro- 
systems, Inc. Instructions for the operating system, the 
object-oriented operating system, and applications or pro- 
grams are located on storage devices, such as hard disk drive 
326, and may be loaded into main memory 304 for execution 
by processor 302. 

[0041] Those of ordinary skill in the art will appreciate 
that the hardware in BIG. 3 may vary depending on the 
implementation. Other internal hardware or peripheral 

devices, such as flash read-only memory (ROM), equivalent 
nonvolatile memory, or optical disk drives and the like, may 
be used in addition to or in place of the hardware depicted 
in FICi. 3. Also, the processes of the present invention may 
be applied to a multiprocessor data processing system. 

[0042] As another example, data processing system 300 
may be a stand-alone system configured to be bootable 
without relying on some type of network communication 
interfaces As a further example, data processing system 300 
may be a personal digital assistant (PDA) device, which is 
configured with ROM and/or flash ROM in order to provide 
non-volatile memory for storing operating system files and/ 
or user-generated data. 

[0043] The depicted example in FIG. 3 and above-de- 
scribed examples are not meant to imply architectural limi- 
tations. For example, data processing system 300 also may 
be a notebook computer or hand held computer in addition 
to taking the form of a PDA. Data processing system 300 
also may be a kiosk or a Web appliance. 

[0044] As previously mentioned, the present invention 
provides a method and apparatus for allowing multiple 
concurrent writer processes to access the same file at 
approximately the same time. The present invention is 
preferably implemented in a computing system that employs 
an application that has its own serialization mechanisms for 
ensuring the integrity of changes to files. In a preferred 
embodiment, this application may be a database application 
such as Oracle and DB2. However, any database application 
that enforces their own serialization for accesses to shared 
files can use concurrent I/O, in accordance with the present 
invention, to reduce CPU consumption and eliminate the 
overhead of copying data twice, i.e. first between the disk 
and the file buffer cache, and then from the file buffer cache 
to the application's buffer. 

[0045] The present invention is predicated on the deter- 
mination that the limits to concurrent write operations 



enforced by file systems such that only one write operation 
may be performed at a time on a file is rooted in the desire 
to avoid two or more processes from changing the allocation 
of data blocks in the file and thereby corrupting the file 
structure. Other software mechanisms exist, such as in 
database applications, for ensuring consistency of the actual 
data written to the file data blocks, e.g., the two-phase 
commit. Therefore, the present invention seeks to remove 
the limitations of existing file systems with regard to write 
operations that do not change the allocation of data blocks 
in a file such that multiple concurrent write operations may 
be performed with the other software application integrity 
mechanisms governing how these changes to the file are to 
be implemented. 

[0046] With the present invention, write operations that do 
not require or result in a change to the allocation of data 
blocks associated with a file may take a reader lock rather 
than the writer lock. As a result, multiple concurrent write 
operations may be performed by processes as long as those 
write operations do not change the allocation of the block of 
data. If, however, a write operation changes the allocation of 
a block of data, then the write operation must obtain the 
writer lock before the operation may be performed. Since 
only one process may obtain the writer lock at a time, this 
forces serialization of write operations that change the 
allocation of data blocks in a file. That is, each write 
operation that changes an aUocation must wait unit the 
writer lock is released by a process that currently is changing 
the allocation of data blocks in the file before it can perform 
its operations. The present invention does not avoid or 
bypass the file locking, but makes use of the file locks to 
permit multiple concurrent readers and writers. 

[0047] FIG. 4A is an exemplary diagram illustrating the 
acquiring of locks with regard lo a write access request that 
requires a change in allocation of data blocks for a file in 
accordance with the present invention. As shown in FIG. 
4A, a file 400 has associated mcladala 410 that includes a 
concurrent writer flag 415, a read lock 420 and a write lock 
430. The concurrent writer flag 415 may be set by an 
application that initially creates the file 400 to indicate 
whether that application permits concurrent writers to the 
file 400. With the present invention, only applications that 
have their own internal serialization or integrity manage- 
ment mechanisms may set the concurrent writer flag 415 
such that the file 400 may be accessed by multiple concur- 
rent writers, i.e. processes that are requesting write access to 
the file 400. An example of such an application is a database 
application which includes its own serialization mechanisms 
for serializing the concurrent writes to data blocks in order 
to maintain the integrity of the file structure. 

[0048] In order for a process to access the file 400, the 
process must obtain a lock on the file 400. If the process 
wishes to read data from the file 400, the process may obtain 
a read lock 420 associated with the file 400. If the process 
wishes to write data to the file 400, the process may have to 
obtain either the read lock 420 or the write lock 430 
depending on the type of write operation being performed. 

[0049] If the write operation that is being performed by a 
process is one that requires or results in a change in the 
allocation of data blocks to the file 400, then the process 
requesting access to the file 400 must obtain the write lock 
430. The access policy associated with the metadata pre- 
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eludes more than one process from acquiring the write lock 
430 at any one time. Thus, if two processes are attempting 
to write the file 400, and both processes' write operations 
require or result in a change to the allocation of data blocks 
in the file 400, then only one of these processes will be 
allowed to proceed by obtaining the write lock 430 while the 
other must spin on the lock. It should also be noted that 
readers must also spin while the writer lock is taken and the 
write lock cannot be taken while there is a reader lock. 

[0050] Thus, as shown in FIG. 4A, process 1440 and 
process 2450 send read access requests to the file system 
requesting access to the file 400 so that they may read data 
from the file 400. As a result, each of process 1440 and 
process 2450 obtain the read lock 420 associated with the 
file 400. Process 3460, however, sends a write access request 
to the file system requesting access to the file 400 so that the 
process 460 may write data to the file 400. This writing of 
data is determined to require or result in a change in the 
allocation of data blocks within file 400. 

[0051] As previously mentioned, one situation in which a 
write access request will change the allocation of the data 
blocks of the file is when a file is extended, i.e. the request 
is a request to write to an oflEset that is greater than the 
current file size. Another situation where a write access 
request will change the allocation of the data blocks is when 
the file is truncated. Both of these situations require an 
update to the metadata structure associated with the file. 

[0052] Another situation that results in a change to the 
metadata structure of the file is when an input/output request 
on the file violates the alignment or length restrictions of 
direct input/output. That is, the use of concurrent input 
output preferably makes certain alignment and length 
restrictions that are to be adhered to by the application's I/O 
requests. By creating file systems with an appropriate block 
size, e.g., by specifying an aggregate block size equal to 512 
kb at file system creation, such applications can benefit from 
the use of concurrent I/O without any modifications to the 
applications. 

[0053] As a result of determining that the Process 3460 
requires a change in the allocation data blocks within the file 
400, the process 460 must obtain the write lock 430 in order 
to perform its write operations to data blocks of the file 400. 
If the process 460 is unable to acquire the write lock 430 
immediately, the process 460 may spin on the write lock 430 
until it is released by the process that currently has the write 
lock 430. 

[0054] With the present invention, if the write operation of 
a process will not require or result in a change in the 
allocation of the data blocks in the file 400, then the process 
may obtain the read lock 420 rather than being forced to 
obtain the write lock 430. That is, the present invention 
differentiates between two different types of write accesses, 
a write that will change the allocation of data blocks in the 
file 400 and a write that will not change the allocation of data 
blocks in the file 400. 

[0055] FIG. 4B is an exemplary diagram illustrating the 
acquiring of locks with regard to a write access request that 
does not change the allocation of data blocks for a file in 
accordance with the present invention. As illustrated in FIG. 
4B, the processes 440 and 450 send read access requests to 
the file system requesting access to the file 400 to read data 



from the file 400. These processes acquire the read lock 420 
and are able to conciurrently perform read operations on the 
data in the file 400. 

[0056] The processes 460 and 470 submit write access 
requests to the file system requesting access to the file 400 
to write data to the file 400. The write operations that 
processes 460 and 470 are intending to perform are deter- 
mined to be of a type that does not require or result in a 
change to the allocation of data blocks in file 400. Since the 
write operations do not change the allocation of data blocks 
in the file 400, the processes 460 and 470 are permitted to 
acquire the read lock 420 and thus, are able to concurrently 
write data to the file 400. Software based mechanisms, such 
as database application serialization mechanisms, arc uti- 
lized to determine how the concurrent write operations are 
to be serialized such that file structure integrity is main- 
tained. 

[0057] Thus, the present invention provides a mechanism 
for eliminating the bottleneck to performance found in the 
access policy of conventional file systems with regard to 
permitting only a single writer to a file at any one time. With 
the present invention, this limitation is Hfted with regard to 
write operations that do not require or result in a change in 
the allocation of data blocks in the file. As a result, multiple 
conciurent write operations may be performed without sac- 
rificing the file structiu-e integrity. Existing software based 
serialization and locking mechanisms associated with an 
application present on the computing system are utilized to 
govern how these concurrent write operations are to be 
reflected in the file structure such that the integrity of the file 
structure is maintained. 

[0058] FIG. 5 is a flowchart outlining an exemplary 
operation of the present invention. It will be understood that 
each block of the flowchart illustration, and combinations of 
blocks in the flowchart illustration, can be implemented by 
computer program instructions. These computer program 
instructions may be provided to a processor or other pro- 
grammable data processing apparatus to produce a machine, 
such that the instructions which execute on the processor or 
other programmable data processing apparatus create means 
for implementing the functions specified in the flowchart 
block or blocks. These computer program instructions may 
also be stored in a computer-readable memory or storage 
medium that can direct a processor or other programmable 
data processing apparatus to function in a particular manner, 
such that the instructions stored in the computer-readable 
memory or storage medium produce an article of manufac- 
ture including instruction means which implement the func- 
tions specified in the flowchart block or blocks. 

[0059] Accordingly, blocks of the flowchart illustration 
support combinations of means for performing the specified 
functions, combinations of steps for performing the speci- 
fied functions and program instruction means for performing 
the specified functions. It will also be understood that each 
block of the flowchart illustration, and combinations of 
blocks in the flowchart illustration, can be implemented by 
special purpose hardware-based computer systems which 
perform the specified functions or steps, or by combinations 
of special purpose hardware and computer instnictions. 

[0060] As shown in FIG. 5, the operation starts by receiv- 
ing a request for access to a file (step 510). A determination 
is made as to whether this access request is a read access 
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request (step 520). If so, the reader lock is taken (step 560). 
If the request is not a read request then it is determined that 
the request is a write access request. 

[0061] If the access request is not a read access request, a 
determination is made as to whether the file to which access 
is requested allows concurrent readers and writers (step 
530). As mentioned above, this may involve determining the 
value of a concurrent writer flag in the metadata of the file, 
for example. If the file does not permit concurrent writers, 
the writer lock is taken (step 540). This assiunes that the 
writer lock is available and has not been acquired by another 
process. If the writer lock is already acquired by another 
process, the current process may spin on the lock until it is 
released so that the current process may acquire it. As 
mentioned above, only one process may acquire the writer 
lock at any one time and thus, no other processes that are 
attempting to perform a write to the file wiU be able to 
perform their operation until after the writer lock is released. 

[0062] If the file does allow multiple concurrent writers, 
then a determination is made as to whether the write request 
is one that will require or result in a change in the allocation 
of data blocks in the file (step 550). If so, the writer lock is 
acquired (step 540) as discussed above. Otherwise, if the 
write request is one that will not require or result in a change 
in the allocation of data blocks in the file, then a reader lock 
may be acquired by the process submitting the write request 
(step 560). As previously mentioned, multiple processes 
may acquire the reader lock on the file and thereby access 
the file concurrently. With the present invention, since write 
requests that do not change the aUocation of data blocks of 
a file may acquire this lock, multiple concurrent writers to 
the file are possible. The present invention allows the 
serialization mechanisms of the applications of the cnmput- 
iiig device, e.g., (lie database application, to govern how 
changes lo the lile are to be committed. Thus, the file system 
of the present invention only hmits processes from writing 
to a file concurrently when the write operations would result 
in a change in the allocation of data blocks of the file. 

[0063] It is important to note that while the present inven- 
tion has been described in the context of a fully functioning 
data processing system, those of ordinary skill in the art will 
appreciate that the processes of the present invention are 
capable of being distributed in the form of a computer 
readable medium of instructions and a variety of forms and 
that the present invention applies equally regardless of the 
particular type of signal bearing media actually used to carry 
out the distribution. Examples of computer readable media 
include recordable-type media, such as a floppy disk, a hard 
disk drive, a RAM, CD-ROMs, DVD-ROMs, and transmis- 
sion-type media, such as digital and analog communications 
links, wired or wireless communications links using trans- 
mission forms, such as, for example, radio frequency and 
light wave transmissions. The computer readable media may 
take the form of coded formats that are decoded for actual 
use in a particular data processing system. 

[0064] The description of the present invention has been 
presented for purposes of iUustration and description, and is 
not intended to be exhaustive or limited to the invention in 
the form disclosed. Many modifications and variations wiU 
be apparent to those of ordinary skill in the art. The 
embodiment was chosen and described ia order to best 
explain the principles of the invention, the practical appU- 



cation, and to enable others of ordinary skiU in the art to 
understand the invention for various embodiments with 
various modifications as are suited to the particular use 
contemplated. 

What is claimed is: 

1. A method of providing write access to a file, compris- 
ing: 

receiving a write access request from a process for write 
access to the file; 

determining if a write operation associated with the write 
access request results in a change to an aUocation of 
data blocks in the file; and 

permitting the process to obtain a read lock associated 
with the file to perform the write operation if the write 
operation does not result in a change to the allocation 
of data blocks in the file. 

2. The method of claim 1, further comprising: 

requiring that the process obtain a write lock associated 
with the file to perform the write operation if the write 
operation results in a change to the allocation of data 
blocks in the file. 

3. The method of claim 1, wherein multiple processes may 
have concurrent access to the file by obtaining a read lock 
associated with the file. 

4. The method of claim 2, wherein only one process may 
obtain the write lock at a time. 

5. The method of claim 1, wherein the process performs 
the write operation to the file concurrently with another 
write operation to the file from another process. 

6. The method of claim 1, wherein determining if the 
write operation results in a change to an allocation of data 
blocks in the file includes determining if the write operation 
is to an offset that is greater than a current file size. 

7. The method of claim 1, wherein determining if the 
write operation results in a change to an allocation of data 
blocks in the file includes determining if the write operation 
is to truncate the file. 

8. A computer program product in a computer readable 
medium for providing write access to a file, comprising: 

first instructions for receiving a write access request from 
a process for write access to the file; 

second instructions for determining if a write operation 
associated with the write access request results in a 
change to an allocation of data blocks in the file; and 

third instructions for permitting the process to obtain a 
read lock associated with the file to perform the write 
operation if the write operation does not result in a 
change to the aUocation of data blocks in the file. 

9. The computer program product of claim 8, further 
comprising: 

fourth instructions for requiring that the process obtain a 
write lock associated with the file to perform the write 
operation if the write operation results in a change to 
the aUocation of data blocks in the file. 

10. The computer program product of claim 8, wherein 
multiple processes may have concurrent access to the file by 
obtaining a read lock associated with the file. 

11. The computer program product of claim 9, wherein 
only one process may obtain the write lock at a time. 
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12. The computer program product of claim 8, wherein 
the process performs the write operation to the file concur- 
rently with another write operation to the file from another 
process. 

13. The computer program product of claim 8, wherein 
the second instructions for determining if the write operation 
results in a change to an allocation of data blocks in the file 
include instructions for determining if the write operation is 
to an offset that is greater than a current file size. 

14. The computer program product of claim 8, wherein 
the second instructions for determining if the write operation 
results in a change to an allocation of data blocks in the file 
include instructions for determining if the write operation is 
to truncate the file. 

15. An apparatus for providing write access to a file, 
comprising: 

means for receiving a write access request from a process 
for write access to the file; 

means for determining if a write operation associated with 
the write access request results in a change to an 
allocation of data blocks in the file; and 

means for permitting the process to obtain a read lock 
associated with the file to perform the write operation 
if the write operation does not result in a change to the 
allocation of data blocks in the file. 



16. The apparatus of claim 15, further comprising: 

means for requiring that the process obtain a write lock 
associated with the file to perform the write operation 
if the write operation results in a change to the alloca- 
tion of data blocks in the file. 

17. The apparatus of claim 15, wherein multiple processes 
may have concurrent access to the file by obtaining a read 
lock associated with the file. 

18. The apparatus of claim 16, wherein only one process 
may obtain the write lock at a time. 

19. The apparatus of claim 15, wherein the process 
performs the write operation to the file concurrently with 
another write operation to the file from another process. 

20. The apparatus of claim 15, wherein the means for 
determining if the write operation results in a change to an 
allocation of data blocks in the file includes means for 
determining if the write operation is to an ofiEset that is 
greater than a current file size. 

21. The apparatus of claim 15, wherein the means for 
determining if the write operation results in a change to an 
allocation of data blocks in the file includes means for 
determining if the write operation is to truncate the file. 
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[57] ABSTRACT 
A file server architecture is disclosed, comprising as 
separate processors, a network controller unit, a file 
controller unit and a storage processor unit. These units 
incorporate their own processors, and operate in paral- 
lel with a local Unix host processor. All networks are 
connected to the network controller unit, which per- 
forms all protocol processing up through the NFS 
layer. The virtual file system is implemented in the file 
control unit, and the storage processor provides high- 
speed multiplexed access to an array of mass storage 
devices. The file controller unit control file information 
caching through its own local cache buffer, and con- 
trols disk data caching through a large system memory 
which is accessible on a bus by any of the processors. 
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machines include RISC-based DEC, HP, and Sun Unix 
PARALLEL I/O NETWORK HLE SERVER workstations. Servers are typically nothing more than 

ARCHITECTURE repackaged client nodes, configured in 19-inch racks 

rather than desk sideboxes. The extra space of a 19-inch 
CROSS-REFERENCE TO RELATED s rack is used for additional backplane slots, disk or tape 

APPLICATIONS drives, and power supplies. 

The present application is related to the following Driven by RISC and CISC microprocessor develop- 
U.S. patent applications, all filed concurrently here- ments, client workstation performance has increased by 
with: more than a factor of ten in the last few years. Concur- 

1. MULTIPLE FACILITY OPERATING SYS- '0 rently, these extremely fast clients have also gained an 
TEM ARCHITECTURE, invented by David Hitz, appetite for data that remote servers are unable to sat- 
Allan Schwartz, James Lau and Guy Harris; isfy. Because the I/O shortfall is most dramatic in the 

2. ENHANCED VMEBUS PROTOCOL UTILIZ- Unix environment, the description of the preferred em- 
ING PSEUDOSYNCHRONOUS HANDSHAKING bodiment of the present invention will focus on Unix 
AND BLOCK MODE DATA TRANSFER, invented " file servers. The architectural principles that solve the 
by Daryl Starr; and Unix server I/O problem, however, extend easily to 

3. BUS LOCKING FIFO MULTI-PROCESSOR server performance bottlenecks in other operating sys- 
COMMUNICATIONS SYSTEM UTILIZING tern environments as well. Similarly, the description of 

HANDSHAKING the preferred embodiment will focus on Ethernet imple- 
AND BLOCK MODE DATA TRANSFER invented 2° mentations, though the principles extend easily to other 
by Daryl D. Starr, William Pitts and Stephen Blight- types of networks 

. ... „ . J . "lost Unix environments, clients and servers ex- 

The above applications are all jwsigned to the as- change file data using the Network File System 
. '"7""°" ^""^ expressly ("NFS"), a standard promulgated by Sun Microsystems 

mcorporated herein by reference. 23 ^^^^^^ ^ community NFS 

BACKGROUND OF THE INVENTION » defined in a document entitled, "NFS: Network File 

1 Field of the Invention Protocol Specification," Request For Com- 

ThTinJemicn relate .Tcomputer data networks, and ^''l 
more particularly, to network file server architectures 30 ''89). This document is incorporated herein by refer- 
for computer networks. "1 '"'."^^^J v u, v,^. • 

2. Description of the Related Art ^ ^"'^ reliable, NFS is not optimal. Clients 

Over the past ten years, remarkable increases in hard- """^ P'*'=^ considerable demands upon both net- 
ware price/performance ratios have caused a startling ^"'^ ^'^^ servers supplying clients with NFS 
shift in both technical and office computing environ- 35 '^^ta. This demand is particularly acute for so-called 
ments. Distributed workstation-server networks are diskless clients that have no local disks and therefore 
displacing the once pervasive dumb terminal attached depend on a file server for application binaries and 
to mainframe or minicomputer. To date, however, net- virtual memory paging as well as data. For these Unix 
work I/O limitations have constrained the potential client-server configurations, the ten-to-one increase in 
performance available to workstation users. This situa- 40 f power has not been matched by a ten-to-one 
tion has developed in part because dramatic jumps in increase in Ethernet capacity, in disk speed, or server 
microprocessor performance have exceeded increases disk-to-network I/O throughput, 
in network I/O performance. The result is that the number of diskless clients that a 

In a computer network, individual user workstations single modern high-end server can adequately support 
are referred to as clients, and shared resources for filing, 45 has dropped to between 5-10, depending on client 
printing, data storage and wide-area communications power and application workload. For clients containing 
are referred to as servers. Clients and servers are all stnan local disks for applications and paging, referred to 
considered nodes of a network. Client nodes use stan- bs dataless clients, the client-to-server ratio is about 
dard communications protocols to exchange service twice this, or between 10-20. 

requests and responses with server nodes. SO Such low client/server ratios cause piecewise net- 

Present-day network clients and servers usually run work configurations in which each local Ethernet con- 
the DOS, Macintosh OS, OS/2, or Unix operating sys- tains isolated traffic for its own 5-10 (diskless) clients 
terns. Local networks are usually Ethernet or Token and dedicated server. For overall connectivity, these 
Ring at the high end, Arcnet in the midrange, or Local- local networks are usually joined together with an 
Talk or StarLAN at the low end. The client-server 55 Ethernet backbone or, in the future, with an FDDI 
communication protocols are fairly strictly dictated by backbone. These backbones are typically connected to 
the operating system environment— usually one of sev- the local networks either by IP routers or MAC-level 
eral proprietary schemes for PCs (NetWare, 3Plus, bridges, coupling the local networks together directly, 
Vines, LANManager, LANServer); AppleTalk for or by a second server functioning as a network inter- 
Maclntoshes; and TCP/IP with NFS or RFS for Unix. 60 face, coupling servers for all the local networks to- 
These protocols are all well-known in the industry. gether. 

Unix client nodes typically feature a 16- or 32-bit In addition to performance considerations, the low 
microprocessor with 1-8 MB of primary memory, a client-to-server ratio creates computing problems in 
640x1024 pixel display, and a built-in network inter- several additional ways: 
face. A 40-100 MB local disk is often optional. Low-end 65 1. Sharing 

examples are 80286-based PCs or 68000-based Maoln- Development groups of more than SO-people cannot 
tosh I's; mid-range machines include 80386 PCs, Macin- share the same server, and thus cannot easily share files 
tosh II's, and 680X0-based Unix workstations; high-end without file replication and manual, multi-server up- 
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dates. Bridges or routers are a partial solution but inflict access to the system memory, and one or more of the 

a performance penalty due to more network hops. CPUs can be connected to a network. This architecture 

2. Administration is disadvantageous as a file server because, among other 
System administrators must maintain many limited- things, both file data and the instructions for the CPUs 

capacity servers rather than a few more substantial 5 reside in the same system memory. There will be in- 
servers. This burden includes network administration, stances, therefore, in which the CPUs must stop run- 
hardware maintenance, and user account administra- ning while they wait for large blocks of file data to be 
tion transferred between system memory and the network 

3. File System Backup CPU. Additionally, as with both of the previously de- 
System administrators or operators must conduct 10 scribed computer architectures, the entire operating 

multiple file system backups, which can be onerously system runs on each of the CPUs, including the network 

time consuming usks. It is also expensive to duplicate CPU. 

backup peripherals on each server (or every few servers In yet another type of computer architecture, a large 

if slower network backup is used). number of CPUs are connected together in a hypercube 

4. Price Per Seat 15 topology. One of more of these CPUs can be connected 
With only 5-10 clients per server, the cost of the to networks, while another can be connected to disk 

server must be shared by only a small number of users. drives. This architecture is also disadvantageous as a file 
The real cost of an entry-level Unix workstation is server because, among other things, each processor 
therefore significantly greater, often as much as 140% runs the entire operating system. Interprocessor corn- 
greater, than the cost of the workstation alone. 20 inunication is also not optimal for file server applica- 

The widening I/O gap, as well as administrative and tions. 

economic considerations, demonstrates a need for high- SUMMARY OF THE INVENTION 
er-performance, larger-capacity Unix file servers. Con- 

version of a display-less workstation into a server may The present invention involves a new, server-specific 

address disk capacity issues, but does nothing to address 25 I/O architecture that is optimized for a Unix file serv- 

fundamental I/O limitations. As an NFS server, the er's most common actions— file operations. Roughly 

one-time workstation must sustain 5-10 or more times stated, the invention involves a file server architecture 

the network, disk, backplane, and file system throughput comprising one or more network controllers, one or 

than it was designed to support as a client. Adding more file controllers, one or more storage processors, 

larger disks, more network adaptors, extra primary 30 and a system or buffer memory, all connected over a 

memory, or even a faster processor do not resolve basic message passing bus and operating in parallel with the 

architectural I/O constraints; I/O throughput does not Unix host processor. The network controllers each 

increase sufficiently. connect to one or more network, and provide all proto- 

Other prior art computer architectures, while not col processing between the network layer data format 

specifically designed as file servers, may potentially be 35 and an internal file server format for communicating 

used as such. In one such well-known architecture, a client requests to other processors in the server. Only 

CPU, a memory unit, and two I/O processors are con- those data packets which cannot be interpreted by the 

nected to a single bus. One of the I/O processors oper- network controllers, for example client requests to run 

ates a set of disk drives, and if the architecture is to be a client-defined program on the server, are transmitted 

used as a server, the other I/O processor would be 40 to the Unix host for processing. Thus the network con- 

connected to a network. This architecture is not optimal troUers, file controllers and storage processors contain 

as a file server, however, at least because the two I/O only small parts of an overall operating system, and 

processors cannot handle network file requests without each is optimized for the particular type of work to 

involving the CPU. All network file requests that are which it is dedicated. 

received by the network I/O processor are first trans- 45 Client requests for file operations are transmitted to 
mitted to the CPU, which makes appropriate requests to one of the file controllers which, independently of the 
the disk-I/O processor for satisfaction of the network Unix host, manages the virtual file system of a mass 
nquesx. storage device which is coupled to the storage proces- 
In another such computer architecture, a disk con- sors. The file controllers may also control data buffer- 
troller CPU manages access to disk drives, and several 50 ing between the storage processors and the network 
other CPUs, three for example, may be clustered controllers, through the system memory. The file con- 
around the disk controller CPU. Each of the other troUers preferably each include a local buffer memory 
CPUs can be connected to its own network. The net- for caching file control information, separate from the 
work CPUs are each connected to the disk controller system memory for caching file data. Additionally, the 
CPU as well as to each other for interprocessor commu- 55 network controllers, file processors Mid storage proces- 
nication One of the disadvantages of this computer sors are all designed to avoid any instruction fetches 
architecture is that each CPU in the system runs its own from the system memory, instead keeping all instruction 
complete operating system. Thus, network file server memory separate and local. This arrangement elimi- 
requests must be handled by an operating system which nates contention on the backplane between micro- 
is also heavily loaded with facilities and processes for 60 processor instruction fetches and transmissions of mes- 
perforraing a large number of Other, non file-server sage and file data. 

tasks. Additionally, the interprocessor communication brief DESCRIPTION OF THE DRAWINGS 
is not optimized for file server type requests. 

In yet another computer architecture, a plurality of The invention will be described with respect to par- 
CPUs, each having its own cache memory for data and 65 ticular embodiments thereof, and reference will be 

instruction storage, are connected to a common bus made to the drawings, in which: 

with a system memory and a disk controller. The disk FIG. 1 is a block diagram of a prior art file server 

controller and each of the CPUs have direct memory architecture; 
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FIG. 2 is a block diagram of a file server architecture on the network layer to transmit the resulting internet 

according to the invention; datagrams. The internet header includes a local net- 

FIG. 3 is a block diagram of one of the network work destination address (translated from the internet 

controllers shown in FIG. 2; address) as well as other parameters. 

FIG. 4 is a block diagram of one of the file controllers $ For messages received by the IP layer from the net- 
shown in FIG. 2; work layer, the IP module determines from the internet 

FIG. S is a block diagram of one of the storage pro- address whether the datagram is to be forwarded to 

cessors shown in FIG. 2; another host on another network, for example on a 

FIG. 6 is a block diagram of one of the system mem- second Ethernet such as 36 in FIG. 1, or whether it is 

cry cards shown in FIG. 2; 10 intended for the server itself. If it is intended for another 

FIGS. 7A-C are a flowchart illustrating the opera- host on the second network, the IP module determines 

tion of a fast transfer protocol BLOCK WRITE cycle; . local net address for the destination and calls on the 

„ . „ „ local network layer for that network to send the data- 

FIGS. 8A-C are a fiowchart illustrating the opera- g^m. If the datagram is intended for an application 

Uon of a fast transfer protocol BLOCK READ cycle. 15 p.^gram within the server, the IP layer strips off the 

DETAILED DESCRIPTION header and passes the remaining portion of the message 

_ J 1^ . . to the appropriate next higher layer. The internet proto- 

For companson purposes and background an illus- ^.^^^^^^ illustrative apparatus of FIG. 1 

trat>ve pnor-art file server arch.tecture will first be jf,,^ Information Sciences Institute, "Internet 

;;^r^nvl S„,,■^riS.•l'^Z°h^'H P^^^'^^'- I^^RPA mtemet program Protocol Specifi- 

Block diagram of a conventional pnor-art Unix-based <• per ?oi <^CoT^»•n1t^.r losn :» i-^^™^ 

file server for Ethernet networks. It consists of a host ^f ' ^^^J^^ (September 1981), which is incorpo- 

^DiT J in -iv • 1 ■ 1. J rated herem by reference. 
CPU card 10 with a single microprocessor on board. 

The host CPU card 10 connects to an Ethernet #1 12, J Lf/UUf Layer 

and it connects via a memory management unit (MMU) 25 T'"* ^^^^^ service with more elaborate 

11 to a large memory array 16. The host CPU card 10 Packaging and addressing options than the IP layer For 

also drives a keyboard, a video display, and two RS232 f^niple, whereas an IP datagram can hold about 1,500 

ports (not shown). It also connects via the MMU 11 and l^^l' ^ ^'f 1 V^'^ datagrams can 

a standard 32-bit VME bus 20 to various peripheral ^ ^"'^ addressed to a particular port 

devices, including an SMD disk controller 22 control- 30 » UDP are alternative protocols 

hng one or two disk drives 24, a SCSI host adaptor 26 '^'^ '^yer; applications requiring ordered reliable 

connected to a SCSI bus 28, a tape controller 30 con- delivery of streams of data may use TCP, whereas ap- 

nected to a quarter-inch tape drive 32, and possibly a pl'cations (such as NFS) which do not require ordered 

network #2 controller 34 connected to a second Ether- »"d reliable delivery may use UDP. 

net 36. The SMD disk controller 22 can communicate 35 ^he prior art file server of FIG. 1 uses both TCP and 

with memory array 16 by direct memory access via bus server-related services, and 

20 and MMU 11. with either the disk controller or the "ses TCP for certain other services which the server 

MMU acting as a bus master This configuration is illus- provides to network clients. The UDP is specified in 

trative; many variations are available. Postel, "User Datagram Protocol," RFC 768 (Aug. 28, 

The system communicates over the Ethernets using 40 '^^O), which is incorporated herein by reference. TCP 

industry standard TCP/IP and NFS protocol stacks. A « specified in Postel, "Transmission Control Protocol," 

description ofprotocol stacks in general can be found in 761 (January 1980) and RFC 793 (September 

Tanenbaum, "Computer Networks" (Second Edition, "81). which is also incorporated herein by reference. 

Prentice Hall:. 1988). File server protocol stacks arc XDR/RPC Uyer 

described at pages 535-546. The Tanenbaum reference 45 This layer provides functions callable from higher 
is incorporated herein by reference. l^^el programs to nm a designated procedure on a re- 
Basically, the following protocol layers are imple- note machine. It also provides the decoding necessary 
mented in the apparatus of FIG. 1: to permit a client machine to execute a procedure on the 
Netwoi* Layer server. For example, a caller process in a client node 
The network layer converts data packets between a 50 may send a call message to the server of FIG. 1. The 
formal specific to Ethernets and a format which is inde- call message includes a specification of the desired pro- 
pendent of the particular type of network used. The cedure, and its parameters. The message is passed up the 
Ethernet-specific format which is used in the apparatus stack to the RPC layer, which calls the appropriate 
of FIG. 1 is described in Hornig, "A Standard For The procedure within the server. When the procedure is 
Transmission of IP Datagrams Over Ethernet Net- 55 complete, a reply message is generated and RPC passes 
works," RFC 894 (April 1984), which is incorporated it back down the stack and over the network to the 
herein by reference. caller client. RPC is described in Sun Microsystems, 
The Internet Protocol (IP) Layer Inc., "RPC: Remote Procedure Call Protocol Specifi- 
This layer provides the functions necessary to deliver cation. Version 2," RFC 1057 (June 1988), which is 
a package of bits (an internet datagram) from a source to 60 incorporated herein by reference, 
a destination over an interconnected system of net- JIPC uses the XDR external data representation stan- 
works. For messages to be sent from the file server to a dard to represent information passed to and from the 
client, a higher level in the server calls the IP module, underlying UDP layer. XDR is merely a dau encoding 
providing the internet address of the destination client standard, useful for transferring data between different 
and the message to transmit. The IP module performs 65 computer architectures. Thus, on the network side of 
any required fragmentation of the message to accom- the XDR/RPC layer, information is machine-independ- 
modate packet size limitations of any intervening gate- ent; on the host application side, it may not be. XDR is 
way, adds internet headers to each fragment, and calls described in Sun Microsystems, Inc., "XDR: External 
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Data Representation Standard," RFC 1014 (June 1987), quests, so many requests may be in process at the same 
which is incorporated herein by reference. time. But there is only one CPU on the card 10, so the 

NFS Layer processing of these requests is not accomplished in a 

The NFS ("network file system") layer is one of the truly parallel manner. The processes are instead merely 
programs available on the server which an RPC request 5 time sliced. The CPU 10 therefore represents a major 
can call. The combination of host address, program bottleneck in the processing of file server requests, 
number, and procedure number in an RPC request can Another bottleneck occurs in MMU 11, which must 
specify one remote NFS procedure to be called. transmit both instructions and data between the CPU 

Remote procedure calls to NFS on the file server of card 10 and the memory 16. All data flowing between 
FIG. 1 provide transparent, stateless, remote access to 10 the disk drives and the network passes through this 
shared files on the disks 24. NFS assumes a file system interface at least twice. 

that is hierarchical, with directories as all but the bot- Yet another bottleneck can occur on the VME bus 
torn level of files. Client hosts can call any of about 20 20, which must transmit data among the SMD disk 
NFS procedures including such procedures as reading a controller 22, the SCSI host adaptor 26, the host CPU 
specified number of bytes from a specified file; writing 15 card 10, and possibly the network #2 controller 24. 
a specified number of bytes to a specified file; creating, PREFERRED EMBODIMENT-OVERALL 

renaming and removing specified files; parsing direc- HARDWARE ARCHITECTURE 

tory trees; creating and removing directories; and read- 
ing and setting file attributes. The location on disk to In FIG. 2 there is shown a block diagram of a net- 
which and from which data is stored and retrieved is 20 work file server 100 according to the invention. It can 
always specified in logical terms, such as by a file han- include multiple network controller (NC) boards, one 
die or Inode designation and a byte offset. The details of or more file controller (FC) boards, one or more storage 
the actual data storage are hidden from the client. The processor (SP) boards, multiple system memory boards, 
NFS procedures, together with possible higher level and one or more host processors. The particular em- 
modules such as Unix VFS and UFS, perform all con- 25 bodiment shown in FIG 2 includes four network con- 
version of logical data addresses to physical data ad- troller boards llOo-llOrf, two file controller boards 
dresses such as drive, head, track and sector identifica- U2a-U2b. two storage processors 114<7-114i', four 
tion. NFS is specified in Sun Microsystems, Inc.. "NFS: system memory cards 116a-116rf for a total of 192 MB 
Network File System Protocol Specification," RFC of memory, and one local host processor 118. The 
1094 (March 1989), incorporated herein by reference. 30 boards 110, 112, 114, 116 and 118 are connected to- 
With the possible exception of the network layer, all gether over a VME bus 120 on which an enhanced 
the protocol processing described above is done in soft- block transfer mode as described in the ENHANCED 
ware, by a single processor in the host CPU card 10. VMEBUS PROTOCOL application identified above 
That is, when an Ethernet packet arrives on Ethernet may be used. Each of the four network controllers 110 
12, the host CPU 10 performs all the protocol process- 35 shown in FIG. 2 can be connected to up to two Ether- 
ing in the NFS stack, as well as the protocol processing nets 122, for a total capacity of 8 Ethernets 122a-122A. 
for any other application Which may be running on the Each of the storage processors 114 operates ten parallel 
host 10. NFS procedures are run op the host CPU 10, SCSI busses, nine of which can each support up to three 
with access to memory 16 for both data and prograin SCSI disk drives each. The tenth SCSI channel on each 
code being provided via MMU 11. Logically specified 40 of the storage processors 114 is used for tape drives and 
data addresses are converted to a much more physically other SCSI peripherals. 

specified form and communicated to the SMD disk The host 118 is essentially a standard SunOs Unix 
controller 22 or the SCSI bus 28, via the VME bus 20, processor, providing all the standard Sun Open Net- 
and all disk caching is done by the host CPU 10 through work Computing (ONC) services except NFS and IP 
the memory 16. The host CPU card 10 also runs proce- 45 routing. Importantly, all network requests to run a user- 
dures for performing various other functions of the file defined procedure are passed to the host for execution, 
server, communicating with tape controller 30 via the Each of the NC boards 110, the FC boards 112 and the 
VME bus 20. Among these are client-defined remote SP boards 114 includes its own independent 32-bit mi- 
procedures requested by client workstations. croproccssor. These boards essentially off-load from 

If the server serves a second Ethernet 36, packets 50 the host processor 118 virtually all of the NFS and disk 
from that Ethernet are transmitted to the host CPU 10 processing. Since the vast majority of messages to and 
over the same VME bus 20 in the form of IP datagrams. from clients over the Ethernets 122 involve NFS re- 
Again, all protocol processing except for the network quests and responses, the processing of these requests in 
layer is performed by software processes running on the parallel by the NC, FC and SP processors, with minimal 
host CPU 10. In addition, the protocol processing for 55 involvement by the local host 118, vastly improves file 
any message that is to be sent from the server out on server performance. Unix is explicitly eliminated from 
either of the Ethernets 12 or 36 is also done by processes virtually all network, file, and storage processing, 
running on the host CPU 10. OVERALL SOFTWARE ORGANIZATION AND 

It can be seen that the host CPU 10 performs an DATA FLOW 

enormous amount of processing of data, especially if 60 

5-10 clients on each of the two Ethernets are making Prior to a detailed discussion of the hardware subsys- 
file server requests and need to be sent responses on a tems shown in FIG. 2, an overview of the software 
frequent basis. The host CPU 10 runs a multitasking structure will now be undertaken The software organi- 
Unix operating system, so each incoming request need zation is described in more detail in the above-identified 
not wait for the previous request to be completely pro- 65 application entitled MULTIPLE FACILITY OPER- 
cessed and returned before being processed. Multiple ATING SYSTEM ARCHITECTURE, 
processes are activated on the host CPU 10 for perform- Most of the elements of the software are well known 
ing different stages of the processing of different re- in the field and are found in most networked Unix sys- 
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tems, but there are two components which are not: quickly identifies NFS-destined packets and then per- 
Local NFS ("LNFS") and the messaging kernel fonns full protocol processing on them to the NFS 
C'MK") operating system kernel. These two compo- level, passing the resulting LNFS requests directly to 
nenis will be explained first. the file controller 112. This protocol processing in- 

The Messaging Kernel 5 eludes IP routing and reassembly, UDP demultiplexing, 

The various processors in file server 100 communi- XDR decoding, and NFS request dispatching. The 
cate with each other through the use of a messaging reverse steps are used to send an NFS reply back to a 
kernel running on each of the processors 110, 112, 114 client. Importantly, these time-consuming activities are 
and 118. These processors do not share any instruction performed directly in the Network Controller 110, not 
memory, so task -level communication cannot occur via 10 in the host 118. 

straightforward procedure calls a it does in conven- The server 100 uses conventional NFS ported from 
tional Unix. Instead, the messaging kernel passes mes- Sun Microsystems, Inc., Mountain View, Calif., and is 
sages over VME bus 120 to accomplish all necessary NFS protocol compatible. 

inter-processor communication. Message passing is pre- Non-NFS network traffic is passed directly to its 
ferred over remote procedure calls for reasons of sim- 15 destination host processor 118. 

plicity and speed. The NCs 110 also perform their own IP routing. 

Messages passed by the messaging kernel have a fixed Each network controller 110 supports two fully parallel 
128-bytelength. Within a single processor, messages are Ethernets. There are four network controllers in the 
sent by reference; between processors, they are copied embodiment of the server 100 shown in FIG. 2, so that 
by the messaging kernel and then delivered to the desti- 20 server can support up to eight Ethernets. For the two 
nation process by reference. The processors of FIG. 2 Ethernets on the same network controller 110, IP rout- 
have special hardware, discussed below, that can expe- ing occurs completely within the network controller 
diently exchange and buffer interprocessor messaging and generates no backplane traffic. Thus attaching two 
kernel messages. mutually active Ethernets to the same controller not 

The LNFS Local NFS interface 25 only minimizes their inter-net transit time, but also sig- 

The 22-function NFS standard was specifically de- nificantly reduces backplane contention on the VME 
signed for stateless operation using unreliable communi- bus 120. Routing table updates are distributed to the 
cation. This means that neither clients nor server can be network controllers from the host processor 118, which 
sure if they hear each other when they Ulk (unreliabil- runs either the gated or routed Unix demon, 
ity). In practice, an in an Ethernet environment, this 30 While the network controller described here is de- 
works well. signed for Ethernet LANs, it will be understood that 

Within the server 100, however, NFS level data- the invention can be used just as readily with other 
grams are also used for communication between proces- network types, including FDDI. 
sors, in particular between the network controllers 110 

and the file controller 112, and between the host proces- 35 ™* Controller 112 

sor 118 and the file controller 112. For this internal In addition to dedicating a separate processor for 



0 be both efficient and convenient, it is NFS protocol processing and IP routing, the se 
undesirable and impractical to have complete stateless- also dedicates a separate processor, the intelligent file 



r unreliable communications. Consequently, a controller 112, to be responsible for all file system pro- 
modified form of NFS, namely LNFS, is used for inter- 40 cessing. It uses conventional Berkeley Unix 4.3 file 
nal communication of NFS requests and responses. system code and uses a binary-compatible data repre- 
LNFS is used only within the file server 100; the exter- sentation on disk. These two choices allow all standard 
nal network protocol supported by the server is pre- file system utilities (particularly block-level tools) to run 
cisely standard, licensed NFS LNFS is described in unchanged. 

more detail below. 45 The file controller 112 runs the shared file system 

The Network Controllers 110 each run an NFS used by all NCs 110 and the host processor 118. Both 
server which, after all protocol processing is done up to the NCs and the host processor communicate with the 
the NFS layer, converts between external NFS requests file controller 12 using the LNFS interface. The NCs 
- and responses and internal LNFS requests and re- 110 use LNFS as described above, while the host pro- 
sponses. For example, NFS requests arrive as RPC 50 cessor 118 uses LNFS as a plug-in module to SunOs's 
requests with XDR and enclosed in a UDP datagram. standard Virtual File System ("VFS") interface. 
After protocol processing, the NFS server translates When an NC receives an NFS read request from a 
the NFS request into LNFS form and uses the messag- client workstation, the resulting LNFS request passes to 
ing kernel to send the request to the file controller 112. the FC 112. The FC 112 first searches the system mera- 
The file controller runs an LNFS server which han- 55 ory 116 buffer cache for the requested data. If found, a 
dies LNre requests both from network controllers and reference to the buffer is returned to the NC 110. If not 
from the host 118. The LNFS server translates LNFS found, the LRU (least recently used) cache buffer in 
requests to a form appropriate for a file system server, system memory 116 is freed and reassigned for the re- 
also running on the file controller, which manages the quested block. The FC then directs the SP 114 to read 
system memory file data cache through a block I/O 60 the block into the cache buffer from a disk drive array, 
•ayer When complete. The SP so notifies the FC, which in 

An overview of the software in each of the proces- turn notifies the NC 100. The NC 110 then sends an 
sors will now be set forth. NFS reply, with the data from the buffer, back to the 

Network Controller 110 „ .'JI^'cd '^I'.'^^'r'^^'^".""' °" ^"'^ 

65 the SP 114 transfers the data into system memory 116, if 
The optimized dataflow of the server 100 begins with necessary, and the NC 110 transferred the data from 
the intelligent network controller 110. This processor system memory 116 to the networks. The process takes 
receives Ethernet packets from client workstations. It place without any involvement of the host 118. 
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Device Layer 

Storage Processor yij^ (jgyice layer is a standard software interface 

The intelligent storage processor 114 manages all disk between the Unix device model and different physical 
and tape storage operations. While autonomous, storage device implementations. In the server 100, disk devices 
processors are primarily directed by the file controller 5 are not attached to host processors directly, so the disk 
112 to move file data between system memory 116 and driver in the host's device layer uses the messaging 
the disk subsystem. The exclusion of both the host 118 kernel to communicate with the storage processor 114. 
and the FC 112 from the actual data path helps to sup- Route and Port Mapper Demons 
ply the performance needed to service many remote The Route and Port Mapper demons are Unix user- 
chents. 10 level background processes that maintain the Route and 

Additionally, coordinated by a Server Manager in the Port databases for packet routing. They are mostly 
host 118, storage processor 114 can execute server inactive and not in any performance path, 
backup by moving data between the disk subsystem and Yellow Pages and Authentication Demon 
tape or other archival peripherals on the SCSI channels. The Yellow Pages and Authentication services are 
Further, if directly accessed by host processor 1 18, SP 1 5 Sun-ONC standard network services. Yellow Pages is a 
114 can provide a much higher performance conven- widely used multipurpose name-to-name directory 
tional disk interface for Unix, virtual memory, and data- lookup service. The Authentication service uses crypto- 
bases. In Unix nomenclature, the host processor 118 can graphic keys to authenticate, or validate, requests to 
mount boot, storage swap, and raw partitions via the insure that requestors have the proper privileges for any 
storage processors 114. 20 actions or data they desire. 

Each storage processor 114 operates ten parallel. Server Manager 
fully synchronous SCSI channels (busses) simulta- The Server Manager is an administrative application 
neously. Nine of these channels support three arrays of suite that controls configuration, logs error and perfor- 
nine SCSI disk drives each, each drive in an array being mance reports, and provides a monitoring and tuning 
assigned to a different SCSI channel. The tenth SCSI 25 interface for the system administrator. These functions 
channel hosts up to seven tape and other SCSI peripher- can be exercised from either system console connected 
als. In addition to performing reads and writes, SP 114 to the host 118, or from a system administrator's work- 
performs device-level optimizations such as disk seek station. 

queue sorting, directs device error recovery, and con- The host processor 118 is a conventional OEM Sun 
trols DMA transfers between the devices and system 30 central processor card, Model 3E/ 120. It incorporates a 
memory 116. Motorola 68020 microprocessor and 4 MB of on-board 

memory. Other processors, such as a SPARC-based 
Host Processor 118 processor, are also possible. 

The local host 118 has three main purposes: to run The structure and operation of each of the hardware 
Unix, to provide standard ONC network services for 35 components of server 100 will now be described in 
clients, and to run a Server Manager. Since Unix and detail. 

ONC are ported from the standard SunOs Release 4 and NETWORK CONTROLLER HARDWARE 

ONC Services Release 2, the server 100 can provide ARCHITECTURE 
identically compatible high-level ONC services such as , j 

the Yellow Pages, Lock Manager, DES Key Authenti- 40 FIG. 3 is a block diagram showing the data path and 
cator, Auto Mounter, and Port Mapper. Sun/2 Net- some control paths for an illustrative one of the network 
work disk booting and more general IP internet services controllers 110c. It comprises a 20 MHz 68020 micro- 
such as Telnet, FTP, SMTP, SNMP. and reverse ARP processor 210 connected to a 32-bit microprocessor 
are also supported. Finally, print spoolers and similar data bus 212. Also connected to the microprocessor 
Unix demons operate transparently. 45 data bus 212 is a 256 K byte CPU memory 214. The low 

The host processor 118 runs the following software order 8 bits of the microprocessor data bus 212 arc 
modules: connected through a bidirectional buffer 216 to an 8-bit 

TCP and Socket Layers slow-speed data bus 218. On the slow-speed data bus 

The Transport Control Protocol ("TCP"), which is 218 is a 128 K byte EPROM 220, a 32 byte PROM 222, 
used for certain server functions other than NFS, pro- 50 and a multi-function peripheral (MFP) 224. The 
vides reliable bytestream communication between two EPROM 220 contains boot code for the network con- 
processors. Socket are used to esublish TCP connec- troUer 110a, while the PROM 222 stores various operat- 
tjQ^_ ing parameters such as the Ethernet addresses assigned 

VFS Interface to each of the two Ethernet interfaces on the board. 

The Virtual File System ("VFS") interface Is a ston- S5 Ethernet address information is read into the corre- 
dard SunOs file system interface. It paints a uniform spending interface control block in the CPU memory 
file-system picture for both users and the non-file paru 214 during initialization. The MFP 224 is a Motorola 
of the Unix operating system, hiding the details of the 68901, and performs various local functions such as 
specific file system. Thus standard NFS, LNFS, and timing, interrupts, and general purpose I/O. The MFP 
any local Unix file system can coexist harmoniously. 60 224 also includes a UART for interfacing to an RS232 

UFS Interface port 226. These functions are not critical to the iaven- 

The Unix File System ("UFS") interface is the tradi- tion and will not be further described herein, 
tional and well-known Unix interface for coromunica- The low order 16 bits of the microprocessor data bus 
tion with local-to-the-processor disk drives. In the 212 are also coupled through a bidirectional buffer 230 
server 100, it is used to occasionally mount storage 65 to a 16-bit LAN data bus 232. A LAN controller chip 
processor volumes directly, without going through the 234, such as the Am7990 LANCE Ethernet controller 
file controller 112. Normally, the host 118 uses LNFS manufactured by Advanced Micro Devices, Inc. Sun- 
and goes through the file controller. nyvale, Calif., interfaces the LAN data bus 232 with the 
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first Ethernet 122o shown in FIG. 2. Control and data . 
for the LAN controller 234 are stored in a 512 K byte -continued 
LAN memory 236, which is also connected to the LAN 
data bus 232. A specialized 16 to 32 bit FIFO chip 240, 
referred to herein as a parity FIFO chip and described '. 
below, is also connected to the LAN data bus 232. Also 
connected to the LAN data bus 232 is a LAN DMA 
controller 242, which controls movements of packets of 
data between the LAN memory 236 and the FIFO chip 
240. The LAN DMA controller 242 may be a Motorola H 
M68440 DMA controller using channel zero only. 

The second Ethernet 122* shown in FIG. 2 connects 
to a second LAN data bus 252 on the network control- 
ler card llOo shown in FIG. 3, The LAN data bus 252 



Bi< Definilion 


Selling 


2 Parity Correcl Mode 


N/A 


3 8/16 biis CPU & PortA interface 




4 Inven Port A address 0 




5 Invert Port A address 1 








7 Reset ' 


no (Oj' 


The Date Transfer Control Regi 


ster is programmed 


as follows: 




Bit Dennition 


Setting 



connects to the low order 16 bits of the microprocessor 15 o Enable PonA Req/Ack yes (!) 

data bus 212 via a bidirectional buffer 250, and has simi- ' Enable PonB Req/Ack yes (1) 

lar components to those appearing on the LAN data bus 3 CPU pari"f enawr''°" lofoT"^"" 

232. In particular, a LAN controller 254 interfaces the 4 PortA parity enable no (O) 

LAN data bus 252 with the Ethernet 1226. using LAN 5 PonB parity enable no (0) 

memory 256 for data and control, and a LAN DMA 20 ^ p*"".^ T ^""^^^ 91 

controller 262 controls DMA transfer of data between 

the LAN memory 256 and the 16-bil wide data port A 

of the parity FIFO 260. Unlike the configuration used on FIFOs 240 and 260, 

The low order 16 bits of microprocessor daU bus 212 '^e microprocessor 210 is responsible for loading and 
are also connected directly to another parity FIFO 270, 25 unloading Port A directly. The microprocessor 210 
and also to a control port of a VME/FIFO DMA con- reads an entire 32-bit word from port A with a single 
troller 272. The FIFO 270 is used for passing messages instruction using two port A access cycles. Port A data 
between the CPU memory 214 and one of the remote transfer is disabled by unsetting bits 0 (Enable PortA 
boards 110, 112, 114, 116 or 118 (FIG. 2) in a manner Req/Ack) and 7 (PortA Master) of the Data Transfer 
described below. The VME/FIFO DMA controller ^0 Control Register. 

272, which supports three round-robin non-prioritized remainder of the control settings in FIFO 270 are 

channels for copying data, controls all data transfers ^ FIFOs 240 and 260 described 

between one of the remote boards and any of the FIFOs at)ove. 

240, 260 or 270, as well as between the FIFOs 240 and ^he NC 110a also includes a command FIFO 290. 
260. '5 The command FIFO 290 includes an input port coupled 

32-bit data bus 274, which is connected to the 32-bit 'o'^^' ^76, and which is directly address- 

port B of each of the FIFOs 240, 260 and 270, is the data '^"^^^ "^^^^ bus 120, and includes an output 

bus over which these transfers take place. Data bus 274 connected to the microprocessor data bus 212. As 

communicates with a local 32-bit bus 276 via a bidirec- explained in more detail below, when one of the remote 
tional pipelining latch 278, which is also controlled by ^ * command or response to the NC 110a, it 

VME/FIFO DMA controller 727, which in turn com- ^° directly wntmg a 1-word (32-bit) message 

municates with the VME bus 120 via a bidirectional descriptor into NC llOa's command FIFO 290. Com- 
buffer 280. mand FIFO 290 generates a "FIFO not empty" status 

The local data bus 276 is also connected to a set of '° *e microprocessor 210, which then reads the mes- 
control registers 282, which are directly addressable *^ sage descriptor offthe top ofFlFO 290 and processes it. 
across the VME bus 120. The registers 282 are used message is a command, then it includes a VME 

mostly for system initialization and diagnostics address at which the message is located (presumably an 

The local data bus 276 is also coupled to the micro- address in a shared memory similar to 214 on one of the 
processor data bus 212 via a bidirectional buffer 284. remote boards). The microprocessor 210 then programs 
When the NC 110a operates in slave mode, the CPU *° **'e HFO 270 and the VME/FIFO DMA controller 
memory 214 is directly addressable from VME bus 120. ^"^^ *e message from the remote location into 

One of the remote boards can copy data directly from •''e CPU memory 214. 

the CPU memory 214 via the bidirectional buffer 284. Command FIFO 290 is a conventional two-port 
LAN memories 236 and 256 are not directly addressed except that additional circuitry is included for 

over VME bus 120. generating a Bus Error signal on VME bus 120 if an 

The parity FIFOs 240, 260 and 270 each consist of an a"empt is made to write to the data input port while the 
ASIC, the functions and operation of which are de- ^'^? Command FIFO 290 has space for 256 

scribed in the Appendix. The FIFOs 240 and 260 are entnes. 

configured for packet data transfer and the FIFO 270 is ^ noteworthy feature of the architecture of NC 110a 
configured for massage passing. Referring to the Ap- *° '^e LAN buses 232 and 252 are independent of 

pendix, the FIFOs 240 and 260 are programmed with *''e microprocessor data bus 212. Data packets being 
the following bit settings in the Data Transfer Confieu- '"'""ed to or from an Ethernet are stored in LAN mem- 
ration Register: ^ °" '^e LAN data bus 232 (or 256 on the LAN 

data bus 252), and not in the CPU memory 214. Data 

65 transfer between the LAN memories 236 and 256 and 

Bit Definilion Seiiine the Ethernets 122a and 1226, are controlled by LAN 

0 WD Mode N/A controllers 234 and 254, respectively, while most data 

1 Parity Chip N/A transfer between LAN memory 236 or 256 and 3 remote 
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port on the VME bus 120 are controlled by LAN DMA Without generating any traffic on the VME backplane 
controllers 242 and 262, FIFOs 240 and 260, and 120 at all, and without disturbing the local host 118. 
VME/FIFO DMA controller 272. An exception to this If the received packet is either an ARP packet which 
rule occurs when the size of the data transfer is small, cannot be processed completely in the NC llOo, or is 
e.g., less than 64 bytes, in which case microprocessor 5 another type of packet which requires delivery to the 
210 copies it directly without using DMA. The micro- local host 118 (such as a client request for the server 100 
processor 210 is not involved in larger transfers except to execute a client-defined procedure), then the micro- 
in initiating them and in receiving notification when processor 210 programs LAN DMA controller 242 to 
they are complete. the packet from LAN memory 236 into FIFO 240, 

The CPU memory 214 contains mostly instructions •« programs FIFO 240 with the direction of data transfer, 
for microprocessor 210, messages being transmitted to and programs DMA controller 272 to read the packet 
or from a remote board via FIFO 270, and various data of FIFO 240 and across the VME bus 120 into 

blocks for controlling the FIFOs, the DMA controllers ^V^'^f' •"^'"o^y "^^ 1" •"'"°P,^of 
and the LAN controllers. The microprocessor 210 ac- 210 first programs the LAN DMA controller 242 with 
cesses the data packets in the LAN memories 236 and " the startmg address and length of the packet m LAN 
256 by directly addressing them through the bidirec- memory 236. and Programs the controller to begin 
tional buffers 230 and 250, respectively, for protocol T 1^^^^'°'^ T , 

processing. The local high-speed station RAM in CPU °f P^^'y "F? 240 as soon as the FIFO .s ready to 
iij^,„,i,»,<.f^;»„,r,,,iH»«r^w»i, ct«.^m^m rcccive dstB. SccoHd, microproccssor 210 progrBms the 
memory 214 can therefore provide zero wait stat^ VME/FIFO DMA controller 272 with the destination 

ory access for microprocessor 210 mdependent ^^^^^ ^ ^ , 

work traffic. This is in sharp contrast to the prior art packet, and instructs the controller to begin trans- 

architecture shown in FIG. 1, m which all data and data P ^^^^ ^ pjp^ ^ ^ »^ 

packets, as well as microprocessor mstructK,ns for host j^^^ ^^^^ microprocessor 210 programs 

CPU card 10, reside m the memory 16 and must com- p^p^ ^40 With the direction of the transfer to take 
municate with the host CPU card " via the MMU 11. j^^^ proceeds entirely under the 

While the LAN data buses 232 and 252 are shown as j,^^ controll4 242 and 272. without any 

separate buses m FIG. 3, it will be understood that they f^^^^^ involvement by microprocessor 210. 
may instead be implemented as a single combined bus ^-^^ microprocessor 210 then sends a message to host 
NETWORK CONTROLLER OPERATION JO that a packet is available at a specified system mem- 
ory address. The microprocessor 210 sends such a mes- 

In operation, when one of the LAN controllers (such ^y writing a message descriptor to a soft ware- 

as 234) receives a packet of information over its Ether- emulated command FIFO on the host, which copies the 
net 122a, it reads in the entire packet and stores it in message from CPU memory 214 on the NC via buffer 
corresponding LAN memory 236. The LAN controller 35 284 and into the host's local memory, in ordinary VME 
234 then issues an interrupt to microprocessor 210 via transfer mode. The host then copies the packet 

MFP 224, and the microprocessor 210 examines the fj^^ system memory 116 into the host's own local 
status register on LAN controller 234 (via bidirectional memory using ordinary VME transfers, 
buffer 230) to determine that the event causing the inter- jf the packet received by NC llOo from the network 
rupt was a "receive packet completed." In order to 40 js an IP packet, then the microprocessor 210 determines 
avoid a potential lookout of the second Ethernet U2b whether it is (1) an IP packet for the server 100 which 
caused by the prioritized interrupt handling characteris- not an NFS packet; (2) an IP packet to be routed to a 
tic of MFP 224, the microprocessor 210 does not at this different network; or (3) an NFS packet. If it is an IP 
time immediately process the received packet; instead, packet for the server 100, but not an NFS packet, then 
such processing is scheduled for a polling function. 45 the microprocessor 210 causes the packet to be trans- 

When the polling function reaches the processing of mitted from the LAN memory 236 to the host 118 in the 
the received packet, control over the packet is passed to same manner described above with respect to certain 
a software link level receive module. The link level ARP packets. 

receive module then decodes the packet according to if the IP packet is not intended for the server 100, but 
either of two different frame formats: standard Ethernet so rather is to be routed to a client on a difTerent network, 
format or SNAP (IEEE S02 LCC) format. An entry in then the packet is copied into the LAN memory associ- 
the header in the packet specifies which frame format ated with the Ethernet to which the destination client is 
was used. The link level driver then determines which connected. If the destination client is on the Ethernet 
of three types of messages is contained in the received i22b, which is on the same NC board as the source 
packet: (1) IP, (2) ARP packets which can be handled 55 Ethernet 122a. then the microprocessor 210 causes the 
by a local ARP module, or (3) ARP packets and other packet to be copied from LAN memory 236 into LAN 
packet types which must be forwarded to the local host 256 and then causes LAN controller 254 to transmit it 
118 (FIG. 2) for processing. If the packet is an ARP over Ethernet 1226. (Of course, if the two LAN data 
packet which can be handled by the NC 110a, such as a buses 232 and 252 are combined, then copying would be 
request for the address of server 100, then the micro- 60 unnecessary; the microprocessor 210 would simply 
processor 210 assembles a response packet in LAN cause the LAN controller 254 to read the packet out of 
memory 236 and, in a conventional manner, causes the same locations in LAN memory to which the packet 
LAN controller 234 to transmit that packet back over was written by LAN controller 234.) 
Ethernet 122<2. It is noteworthy that the data manipula- The copying of a packet from LAN memory 236 to 
tion for accomplishing this task is performed almost 65 LAN memory 256 takes place similariy to the copying 
completely in LAN memory 236, directly addressed by described above from LAN memory to system mem- 
microprocessor 210 as controlled by instructions in ory. For transfer sizes of 64 bytes or more, the micro- 
CPU memory 214. The function is accomplished also processor 210 first programs the LAN DMA controller 
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LAN 'mimnrv '"^'^a^' "'"'^ ^"^^^^ concatenation of data extracted from a large number of 

iet^ t?^^^- H'.'"r «o individual IP packets stored in LAN memory 236. re- 

^TA oT^^^^I'Wn Z^ LAN memory 236 into suiting in a linked list, in CPU memory 214, pointing to 

to receive dCl STh ^ " °f "^^^^ LAN memory 236 in the 

to receive data. Second, microprocessor 210 programs 5 correct sequence 

the LAN DMA controller 262 with a destination ad- The exact details of the LNFS format are not impor- 

rZZT ,nH T'^y'^' ^"'^ '^If '^"e'h °f 'he data tant for an understanding of the invention, except to 

from iant^ FIFO 260 f ,'°I'h™ 'I m "^'^ " commands to maintain a directory 

n-orn parity FIFO 260 into the LAN memory 256, of files which are Stored on the disks attached to the 

?MAcrtXr2^Tto c Lr^T ^^E/FIFO 10 storage processors 114. com«a«ds for ri^Sd w^t 

B ofthrP S J!er t^^ hTk tf'" T '^'^ '° ^'""^ » fi'^ °" *^ various 

B of FIFO £o^'in«^r • '"'^ T'^ P*''* configuration management and diagnostics control mes- 

Lms the?wo FTFofS^il . ^'^^ '^^'^ maintenance ^mmands which are 

fh.Tr,ncf , r , direction of supported by LNFS include the following messages 

In irelv ,nL°,h ''t' J'^^.T'^"' "jf" based On conventional NFS. get attributes of a fil^E! 

and 2^2 , control of DMA controllers 242. 262 TATTR); set attributes of a file (SETATTR); look up a 

rrocesL n I if/th l-^^'^^"*? "« "icro- file (LOOKUP); created a file (CREATE) remove a 

w«.r^ f i ^- *° (REMOVE); rename a file (RENAME) created a 

b'v^es the '"^y^ ^"''1*^ "'""/^ (LINK); create a symlink (SYMLINK); 

dfrecilv wiZ,?nMr '° ^"'"""'^ (RMDIR); and return file system 

Wh/„' rj^i ^ f A r^.>, * „ , (STATFS). The data transfer commands sup- 

IsYcnLZl^.f - ^ t 2*2 and ported by LNFS include read from a file (READ); 

2« complete their Work, they so notify microprocessor write to a file (WRITE); read from a directory (READ- 

Wh. ' "'"P' P'Z"!^'' '^'""^'^ ^'^>> ^"'l '^"^ * (READLINK). LNFS also sup- 
224. When the microprocessor 210 has received both 25 ports a buffer release command (RELEASE) for noti 

T':TI " 'Tp LAN controller 254 to transmit Fying the file controller Z an NC Ts fmfshL u^^^^^^^^^^ 

the packet on the Ethernet 122A in a conventional man- specified buffer in system memory. It also supj^m a 

Th,.^ TP r^nti.,., k . ,u . r: 1. VOP-derived access command, for determining 

Thus. IP routing between the two Ethernets in a whether a given type access is legal for specified cr^ 

single network controller 110 takes plaoe over data bus 30 dential on a specified file spccinec cre- 

lot[ orocessof ,",V Hi« ' K'Tr^^^ ^^^^ '^e writing of file data 

, processor 118 disturbed for such routing, in con- from the LAN memory 236 to disk, the NC llOo first 

all but the shortest copying work ts performed by con- by the appropriate FC 112. When a pointer to the buff^ 
trollers outs.de microprocessor 210, requiring the in- 35 is returned, microprocessor 210 pro™ LAN DM A 

volvemem of the microprocessor 210, and bus traffic on controller 242. parity FIFO MO 3 VME/FIFO 

microprocessor data bus 212, only for the supervisory DMA controller 272 to transmitTe entire bS of S 

VME/FIFC? nui T^T^.^'"" '° "^'^ '^^"^f" ^"'^ described above for 

^JJfnC loSr^ . ° " P™S^^'"'"^d by 40 transmitting IP packets and ARP packets to system 

& the LAN DM^co^trolV '''' ' '''"^ "'"^"^ typicall/have 

lr.LZ.^i^i controllers 242 and 262 are pro- portions scattered throughout LAN memory 236 The 

fnn?rnil ''^ '^O'"™' '«>g'«^rs On the respective microprocessor 210 accommodates that simation bj 

Te w!frecSal%Tffir^'''°H «n ^""I''' '"'P""" P^^S^^'""''"^ LAN DMA controller 242 successively 
AM ^ «, """^ respective 45 for each portion of the data, in accordance with the 

^^ J^ ' P^"'y ''^PO^ ^ ""'^^d list; after receiving not fication tha, the previou! 

P'^S^mn'ed as set forth in the Appendix. portion is complete. The microprocessor 210 criro 

roIlJd TpIT ^^y"^'^'"*" ° *e IP packet to be gram the parity FIFO 240 and the VME/FIFO^MA 

routed IS on an Ethernet connected to a different one of controller 272 once for the entire message as Ions as the 

StoTani M P^'"'' ^^P'^'^ '""^^ ''^^ placed coS'us y in system 

I 'J'VP^ ^'^^ °" '^«= NC no to memory 116, If it is not, then the mi^ronrocesMr 210 

V 116^ ^'h ^°Py'"8 the packet into system blocks in the same manner LAN DMA controH^2 

Tec, o « lp^^""r ''^'"^^'^ ^"'^ "^'^^^^^ controller lift, receives a 
spect to certain ARP packets, and then notifying the 55 from another processor in server 100, usually from file 

fs so"nTfie^ i ' " " "^^f" "2 fi'^ data is availableTty^tcm^em- 

DM A noiVroit f ' T" /""'^ ^''^^ '^'^ "'^ transmission on one of the Ethernets, for 

DMA controllers to copy the packet from system mem- example Ethernet 122a. then the network controller 

ory 116 into the appropriate LAN memory. It is note- 110a ^pies the file data iZ LAN memorvlae ^1^^ 

VME bus traffic, it still does not involve the host CPU direction. In particular" the microprocessor 210 first 

If the IP nacket rer^iv^H ,1. TJ.T. . progirams VME/FIFO DMA controller 272 with the 

Jl .LL^ i l^ ' Vi^ Ethernet 122fl and starting address and length of the data in system mera- 

iin J 1, ' ^ microprocessor 65 ring data over the VME bus 120 into port B of parity 

«.r,^r.f [h. "^'^"^'^ Pi°*°^°' prepossessing to FIFO 240 as soon as the FIFO is ready to receive Tti 

NFsTL^FS^fnL'^r^^^^ ""^ 'T'" '° "^^ microprocessor 210 then programs the LAN DMA 

NFS (LNFS) format. This may well involve the logical controller 242 with a destination iddress in LAN mem- 
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orv 236 and then length of the file data, and instructs processor data bus 312 is a 256 K byte shared CPU 

Zx control to transfer data from the parity FIFO memory 3U. The low order « °f fj* ^^^^j 

240 into the LAN memory 236. Third, microprocessor sor data bus 312 are connected through « bKim^ttonal 

no programs the parity FIFO 240 with the direction of buffer 316 to an 8-bU ^l^.-^P^^f/^-^^^^" '^^^^ 
the transfer to take place. The transfer then proceeds 5 speed data bus 318 are a l^^ J^ by^PROM MO a^^^^^^^ 

entirely under the control of DMA controllers 242 and multifunction penpheral (MFP) 324. The functions ol 

rrwitS^fm airSe; involvement by the micro- the PROM 320 -f, ^ -^'J^^^^^^^ 

orocessor 210 Again, if the file data is scattered in descnbed above with respect to EP^O'^ "0 and Mhl- 

frX r2 wfthTE iS^n^^SL^t o^r^^^^^ U m The parallel port 392 is mainly for testing and 

'•"XTeach" of the DMA controllers 242 and 272 "1!^*" NC 110.. the FC 112. is --ected to the 
comolete thefr work they so notify microprocessor 210 VME bus 120 via a bidirectional bufl^er 380 and a 32-bit 
Zugh MFP The microprc^essor 210 then per- 15 local data bus 376. A set of control regtsters 382 are 
orms'all necessary protocol processing on the LNFS connected to *e 'ocal data bus 376 and d^^ ad- 
message in LAN memory 236 in order to prepare the dressable across the VME bus 120. The local data Bus 
mfs Tor transmission over the Ethernet 122. in the 376 is also coupled .o the ^^^^Z^^l^''^,;^, 
form of Ethernet IP packets. As set forth above, this via a bidirectiona^ bluffer 384. T^«,P^" if^^fJI" 
nroiocol nrocessine is performed entirely in network 20 addressabihtyofCPUmemory 314 from VME bus m 
l^wUhirS involvemen' of the local FC 112. also includes . command f JM. wtach 
h t 118 includes an input port coupled to the local data bus 376 

I should be noted that the parity FIFOs are designed and which is directly addreKable across Ae VME bus 
to move multiples of 128-byte blocks most efficiently. 120. The command FIFO 390 also mcludes an output 
mdlta Transfer size through port B is always 32.bi.s 25 porl connected '° '^e microprocessor data bus 312_m 
wideband the VME address coii-esponding to the 32-bit structure, operation and P'^^'ose of command FJFO 
data must be quad-byte aligned. The data transfer size 390 are the same as described above with respect 
for Don A can be either 8 or 16 bits. For bus utilization to command FIFO 290 on NC llOa. 
rea^ns^ h ifit to 6 bits when the corresponding local The FC 112. omits the LAN data buses 323 and 352 
tart add^st Lble-b^^^ aligned, and is set a. 8 bits 30 which are present in NC 110.. but instead -ncludes a 4 
oth"rwi« TTie TCP/IP checkfum is always computed megabyte 32-b,t wide FC memory 396 coupled to the 
C the^^bit mode. Therefore, the checksum word re- nik-Processor data bus 3 2 via ^ b'f -cnonal buffer 
Quires byte swapping if the local start address is not 394. As will be seen, FC memory 396 is used as a cache 
2™^hV hvV; aWd memory for file control mformation, separate from the 

AccS ngly fo^ransfer from port B to port A of 35 file data information cached in system memory 116. 
anv of the F FOs 240. 260 or 270.7he microprocessor The file controller embodiment shown m FIG. 4 doe 
«0 program the VMk/tlFO DMA controller to pad not include any DMA controllers, and hence cannot act 
fhe frarfsfer count to the ne« 28-l,yte boundary. The « a master for transmitting or receiving data in any 
extra 32-bit word transfers do not inv^lvethe VMEbus. block transfer mode, over '^1^^^^^ ^^ J^"^^ 
3„d only the desired number of 32-bit words will be « -^^^^^^ ^^^^^^ 

""l°^^r,nrr.''from nort A to ix>rt B of the parity an VME bus slave. In such transfers, the remote master 
FIFO 2^0 the icrXLor 21oTads ^ rt A word' addresses the CPU memory 314 or the FC memory 396 
S w°or?an^'fo™rrnro Ml indicati^ when it is directly over *e VME bus IJMhro^h the b.direc- 
finished. The FIFO full indication enables unloading 45 tional buffers 384 and, if appropriate, 35»4. 
from port B. The same procedure also takes place for pjj^^ CONTROLLER OPERATION 

transfers from port A to port B of either of the parity crill/, « Kasicallv to orovide 

FIFOs 240 or 260, since transfers of fewer than 128 The purpose of the FC 112. is 
byfes areper?ormed under local microprocessor control virtual file system services m ^""^""J^^^V^- 
rather [ha^under the control of LAN DMA controller 50 vided in LNFS format by remote processors «>n the 
M2 or 262. For all o^ the FIFOs, the VME/FIFO VME bus 120. Most requests will come ^om a network 
DMA controller is programmed to unload only the controller 110, but requests may also come from the 
desired number of 32-bit words. '"^^eXVelated commands supported by LNFS are 

FILE CONTROLLER HARDWARE 55 identified above. They are all specified to the FC 112. 

ARCHITECTURE terms of logically identified disk data blocks. For 

•n,. liu ,.™ifmllers rPC^ 112 niav each be a standard example, the LNFS command for reading data from a 
ofr^/LeK«!wo2ir tJ^^ manu- file includes a specification of the file from which to 

£tf bf SXc'preS; a more re«l (file system ID JSID) ffle ID (^^^^ b^e 

socialized board is used such as that shown in block « offset, and a count of the number of bytes to read. The 
specialized boara IS us«j sucn » pC 112. converts that identification into physical form. 

^'Wg 4T0WS on° of the PCs 112.. «,d it will be namely disk «id sector numbers, in order to satisfy the 
Lt;«Tis%^ratLrf^^^^^^^^ "'^e'^C 112. nms a conventional Fast File System 

X hown in FIG, 3. and in some respects it is scaled 65 (FFS or UPS), which is based on the Berkeley 4.3 VAX 
Ip UkeT^e NC lio; FC 112. comprises a 20 MHz release. This code performs the conversion and also 
68020 microprocessor 310 comiected to a 32.bit micro- performs all disk data caching and control data caching. 
prS«»r data bus 312. Also connected to the micro- However, as previously mentioned, control daU each- 
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ing IS performed using the FC memory 396 on FC Ilia. the PC 1120. Since FC lllo employs write-through 
whereas disk data caching is performed using the sys- caching, when It receives the release command from the 
tem memory 116 FIG. 2 . Caching this file control requestor, it instructs storage processor 114 to copy the 
information within the FC 112a avoids the VME bus data from system memory 1J6 onto the appropriate disk 
congestion and speed degradation which would result if 5 sectors before freeing the system memory buffers for 
file control information was cached in system memory possible reallocation. 

"Ij: The READDIR transaction is similar to read and 

The memory on the FC 112a is directly accessed over write, but the request is satisfied by the FC 112a directly 
the VME bus 120 for three main purposes. First, and by out of its own FC memory 396 after formatting the 
far the most frequent, are accesses to FC memory 396 10 requested directory information specificaJly for this 
by an SP 114 to read or write cached file control infor- purpose. The FC 1 12a causes the storage processor read 
mation. These are accesses requested by FC 112a to the requested directory information from disk if it is not 
write locally modified file control structures through to already locally cached. Also, the specified offset is a 
disk, or to read file control structures from disk. Sec- "magic cookie" instead of a byte offset, identifying 
ond, the FC s CPU memory 314 is accessed directly by 15 directory entries instead of an absolute byte offset into 
other processors for message transmissions from the FC the file. No file attributes arc returned. 
I12a to such other processors. For example, if a data The READLINK transaction also returns no file 
block m system memory is to be transferred to an SP attributes, and since links are always read in their en- 
114 for writing to disk, the FC 112a first assembles a tirety, it does not require any offset or count 
message m its local memory 314 requesting such a trans- 20 For all of the disk dau caching performed through 
fer. The FC 112a then notifies the SP 114, which copies system memory 116, the FC 112a acts as a central au- 
the message directly from the CPU memory 314 and thority for dynamically allocating, deallocating and 
executes the requested transfer. keeping track of buffers. If there are two or more FCs 

A third type of direct access to the PC's local mem- 112, each has exclusive control over its own assigned 
ory occurs when an LNFS client reads directory 25 portion of system memory 116. In all of these transac- 
entries. When FC 112fl receives an LNFS request to tions, the requested buffers are locked during the period 
read directory entnes, the FC 112a formats the re- between the initial request and the release request. This 
quested directory entnes in FC memory 396 and notifies prevents corruption of the data by other clients 
the requestor of their location. The requestor then di- Also in the situation where there are two or more 
rectly accesses FC memory 396 to read the entries. 30 FCs, each file system on the disks is assigned to a partic- 
The version of the UPS code on FC 112a includes ular one of the FCs. FC #0 runs a process called 
some modifications in order to separate the two caches, FC_V1CE_PRES1DENT, which maintains a list of 
In particular, two sets of buffer headers are maintained, which file systems are assigned to which FC, When a 
one for the FC memory 396 and one for the system client processor (for example an NC 110) is about to 
memory 116. Additionally, a second set of the system 35 make an LNFS request designating a particular file 
RWPiT?rn""^^np^!i;'^9;.®'^^^^^^' BREADO. system, it first sends the fsid in a mLage to the 
BWRITEO, and BREADAQ) exist, one for buffer ac- FC-VICE-PRESIDENT asking which FC controls 
cesses to FC Mem 396 and one for buffer accesses to the specified file system. The FC_VICE_PRESI- 
system memory 116. The UFS code is further modified DENT responds, and the client processor sends the 
to call the appropriate buffer routines for FC memory 40 LNFS request to the designated FC. The client proces- 
396 for accesses to file control information, and to call sor also maintains its own list of fsid/FC pairs as it 
the appropnate buffer routines for the system memory discovers them, so as to minimize the number of such 
116 tor the caching of disk data. A description of UFS requests to the FC_ VICE-PRESIDENT 
may be found in chapters 2, 6, 7 and 8 of "Kernel Struc- 
ture and Flow," by Rieken and Webb of .sh consulting 45 STORAGE PROCESSOR HARDWARE 
(Santa Clara, Calif.: 1988), incorporated herein by refer- ARCHITECTURE 

. ^. In the file server 100, each of the storage processors 

When a read command is sent to the FC by a re- 114 can interface the VME bus 120 with up to 10 differ- 
■ questor such as a network controller, the FC first con- ent SCSI buses. Additionally, it can do so at the full 
verts the file, offset and count information mto disk and 50 usage rate of an enhanced block transfer protocol of 55 
sector mformation. It then locks the system memory MB per second. 

buffers which contain that information, instructing the FIG. 5 is a block diagram of one of the SPs 114a SP 
stora^procwsor 114 to read them from disk if neces- 1146 is identical. SP 114a comprises a microprocessor 
sary. When the buffer is ready, the FC returns a mes- 510. which may be a Motorola 68020 microprocessor 
sage to the requestor contaming both the attributes of 55 operating at 20 MHz. The microprocessor 510 is cou- 
the designated file and an array of buffer descriptors pled over a 32-bit microprocessor data bus 512 with 
that 'dentify the locations in system memory 116 hold- CPU memory 514, which may include up to 1 MB of 
* ^ ^ . static RAM. The microprocessor 510 accesses instruc- 

After the requestor has read the data out of the buff- tions, data and status on its own private bus 512, with no 
ers. It sends a release request back to the FC. The re- 60 contention from any other source. The microprocessor 
lease request is the same message that was returned by 510 is the only master of bus 512 
the FC in response to the read request; the FC 112a uses The low order 1 6 bits of the microprocessor data bus 
the information contained therem to determine which 512 interface with a control bus 516 via a bidirectional 
buffers to free. buffer 518. The low order 8 bits of the comrol bus 516 

A write command is processed by FC 112a similarly 65 interface with a slow speed bus 520 via another bidirec- 
to the read command, but the caller is expected to write tional buffer 522. The slow speed bus 520 connects to an 
to (instead of read from) the locations in system mem- MFP 524, similar to the MFP 224 in NC 110a (FIG. 3), 
ory 116 Identified by the buffer descriptors returned by and with a PROM 526, similar to PROM 220 on NC 



23 

110(3 The PROM 526 comprises 128 K bytes of 
EPROM which contains the functional code for SP 
114<i, Due to the width and speed of the EPROM 526, 
the functional code is copied to CPU memory 514 upon 
reset for faster execution. 

MFP 524, like the MFP 224 on NC 110a. comprises a 
Motorola 68901 multifunction peripheral device. It 
provides the functions of a vectored interrupt control- 
ler, individually programmable I/O pins, four timers 
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round shielded cables. Six 50-pin cables provide 300 
conductors which carry 18 signals per bus and 12 
grounds. The cables attach at the front panel of the SP 
114a and to a commutator board at the disk drive array. 
> Standard 50-pin cables connect each SCSI device to the 
commutator board. Termination resistors are installed 
on the SP I14a. 

The SP 114a supports synchronous parallel data 
transfers up to S MB per second on each of the SCSI 



and a UART. The UART functions provide serial com- 10 buses 540, arbitration, and disconnect/reconnect ser- 



munications across an RS 232 bus (not shown in FIG. 5) 
for debug monitors and diagnostics. Two of the four 
timing functions may be used as general-purpose timers 
by the microprocessor 510, either independently or in 
cascaded fashion. A third timer function provides the 
refresh clock for a DMA controller described below, 
and the fourth timer generates the UART clock. Addi- 
tional information on the MFP 524 can be found in 
"MC 6890 Multi-Function Peripheral Specification," 
by Motorola, Inc., which is incorporated herein by 
reference. 

The eight general-purpose I/O bits provided by MFP 
524 are configured according to the following table: 



vices. Each SCSI bus 540 is connected to a respective 
SCSI adaptor 542, which in the present embodiment is 
an AlC 6250 controller IC manufactured by Adaptec 
Inc., Milpitas, Calif., operating in the non-multiplexed 
15 address bus mode. The AIC 6250 is described in detail 
in "AIC-6250 Functional Specification," by Adaptec 
Inc., which is incorporated herein by reference. The 
SCSI adaptors 542 each provide the necessary hard- 
ware interface and low-level electrical protocol to im- 
plement its respective SCSI channel. 

The 8-bit data port of each of the SCSI adaptors 542 
is connected to port A of a respective one of a set of ten 
parity FIFOs 5440-544/ The FIFOs 544 are the same as 
FIFOs 240, 260 and 270 on NC llOo, and are connected 
and configured to provide parity covered data transfers 
between the 8-bit data port of the respective SCSI adap- 
tors 542 and a 36-bit (32-bit plus 4 bits of parity) com- 
mon data bus 550. The FIFOs 544 provide handshake, 
^ status, word assembly/disassembly and speed matching 
FIFO buffering for this purpose. The HFOs 544 also 
generate and check parity for the 32-bit bus, and for 
RAID 5 implementations they accumulate and check 
redundant data and accumulate recovered data. 
35 All of the SCSI adaptors 542 reside at a single loca- 
tion of the address space of the microprocessor 510, as 
do all of the parity FIFOs 544. The microprocessor 510 
selects individual controllers and FIFOs for access in 
pairs, by first programming a pair select register (not 
40 shown) to point to the desired pair and then reading 
from or writing to the control register address of the 
desired chip in the pair. The microprocessor 510 com- 
municates with the control registers on the SCSI adap- 

tors 542 via the control bus 516 and an additional bidi- 

~ 45 rectional buffer 546, and communicates with the control 

Commands are provided to the SP 114. from the registers on FIFOs W4 ^j" 
VME bus 120 via a bidirectional buffer 530, a local data bidirectional buffer 552. Both the SCSI adaptors 542 
bus 532 and a command FIFO 534. The command and FIFOs 544 employ -b|t control registers «,d regjs- 
FIFO 534 is similar to the command FIFOs 290 and 390 ter addressmg of the FIFOs 544 is arranged such that 
on NC llOfl and FC 112a, respectively, and has a depth 50 such registers alias in consecutive byte locations. This 
of 256 32-bit entries. The command FIFO 534 is a write- allows the microprocessor 510 to wnte to the registers 



functions «s an early warning 
SCSI Attention - A composite of the 
SCSI. Atlentiors from all 10 SCSI 
channels. 

Channel Operation Done - A composite ol 
the channel done bits from all 13 
channels of the DMA c( 



DMA Controller Enable Enables the DMA 
Controller to run. 

VMEbui Interrupt Done - Indicates the 
completion of a VMEbus Interrupt. 
Command Available - Indicates that the 
SP'S Command Fifo. described below, 
contains one or more command pointers. 
External Interrupts Disable. Disables 
externally generated interrupts to the 
microprocessor JIO. 

Command Fifo Enable Enables operation 
of the SP'S Command Fifo. Clears the 
Command Fifo when reset 



only register as seen on the VME bus 120, and as a 
read-only register as seen by microprocessor 510. If the 
FIFO is full at the beginning of a write from the VME 
bus, a VME bus error is generated. Pointers are re- 
moved from the command FIFO 534 in the order re- 
ceived, and only by the microprocessor 510. Command 
available status is provided through I/O bit 4 of the 
MFP 524, and as a long as one or more command point- 
ers are still within the command FIFO 534, the com- «0 
mand available status remains asserted. 

As previously mentioned, the Sp 114a supports up to 
10 SCSI buses or channels 540fl-540/. In the typical 
configuration, buses 540a-540i support up to 3 SCSI 
disk drives each, and channel 540y supports other SCSI 65 
peripherals such as tape drives, optical disks, and so on. 
Physically, the SP 114fl connects to each of the SCSI 
buses with an uhra-miniature D sub connector and 



single 32-bit register, thereby reducing instruction 
overhead. 

The parity FIFOs 544 are each configured in their 
55 Adaptec 6250 mode. Referring to the Appendix, the 
FIFOs 544 are programmed with the following bit set- 
tings in the Data Transfer Conriguration Register: 



WD Mode 
Parity Chip 
Parity Correct Mode 
8/16 bits CPU & PoitA interface 
Invert Fort A address 0 
Invert Port A address 1 
m Carry Wiap 
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Definit 



Enable PortA Req/Ack 
Enable PonB Req/Aok 
Data Transfer Direction 
CPU parity enable 
PortA parity enable 
PonB parity enable 
Checksum Enable 
PortA Master 



In addition, bit ♦ of the RAM Access Control Regis- 
ter (Long Burst) is programmed for 8-byte bursts. ,5 

SCSI adaptors 542 each generate a respective inter- 
rupt signal, the status of which are provided to micro- 
processor 510 as 10 bits of a 16-bit SCSI interrupt regis- 
ter 55«. The SCSI interrupt register 556 is connected to 
the control bus 516. Additionally, a composite SCSI 
interrupt is provided through the MFP 524 whenever 
any one of the SCSI adaptors 542 needs servicing. 

An additional parity FIFO 554 is also provided in the 
SP 1140, for message passing. Again referring to the 
Appendix, the parity FIFO 554 is programmed with the „ 
following bit settings in the Data Transfer Configura- 
lion Register: 



and recognizes completion of the specified operation by 
testing for a "done" bit in the channel status register 
562. The microprocessor 510 then resets the enable bit, 
which causes the respective "done" bit in the channel 
- 5 status register 562 to be cleared 

The channels are defined as follows: 



CHANNEL FUNCTION 

0:9 Tliese channels control data ir 

from the respective FIFOs 544 via tlie common 
data bus SX. When a FIFO is enabled and a 
request is received from it, the channel 
becomes ready. Once the channel has been 
serviced a status of done is generated. 
11:10 These channels control data movement between 
a local data buffer 564. described below, 
and the VME bus 120. When enabled the 
channel becomes ready. Once the channel has 
been serviced a status of done is generated. 

12 When enabled, this channel causes the DRAM in 
local data buffer 564 to be refreshed based 
on a clock which is generated by the MFP 524. 
The refresh consisu of a burst of 16 rows. 
This channel does not generate a status of 



WD Mode 

Parity Chip 

Parity Correct Mode 

8/16 bits CPU li PortA interface 

Invert Port A address 0 

Invert Port A address 1 

Checksum Carr>- Wrap 



is serviced by this channel. When enable is 
set and the FIFO 554 asserts a request then 
the channel becomes ready. This channel 
generates a status of done. 
Low latency writes from microprocessor 510 
onto the VME bus 120 are controlled by this 
channel. When this channel is enabled data is 
described below, onto the VME bus 120. This 



Channels are prioritized to allow servicing of the 
more critical requests first. Channel priority is assigned 
in a descending order starting at channel 14. That is, ir 



Enable PortA Req/Ack 
Enable PortB Req/Ack 
Data Transfer Direction 
CPU parity enable 
PortA parity enable 
PonB parity enable 



The Data Transfer Control Register is programmed the event that all channels are requesting service, chan- 
as follows; nel 14 will be the first one served. 

40 The common data bus 550 is coupled via a bidirec- 
tional register 570 to a 36-bit junction bus 572. A second 
bidirectional register 574 connects the junction bus 572 
with the local data bus 532. Local data buffer 564, 
which comprises 1 MB of DRAM, with parity, is cou- 
45 pled bidirectionally to the junction bus 572. It is orga- 
nized to provide 256 K 32-bit words with byte parity. 
^nctRsum i^naoic Operates the DRAMs in page mode to 

7 PoriA Master'"'*' (o) Support a Very high data rate, which requires bursting 

~ ■ of data instead of random single-word accesses. It will 

In „t,t;ti«„ ^ » A ^ , » 50 be seen that the local data buffer 564 is used to imple- 
J?i R ' A RAM Access Control Regis- mem a RAID (redundant array of inexpensive disks) 

pSTa n?F7Fn 8-byte bursts. algorithm, and is not used for direct readin7«.d writing 

b..f s7A„/ ^ " connected to the 16-b,t comrol between the VME bus 120 and a periphefal on one of 
bus 516, and port B is connected to the common data the SCSI buses 540 

^"lorn^rr^^ '"^"U' "^^'""^ "'^ " ^ read-only register 576, containing all zeros, is also 

m^roproceswr 510 can communicate directly with the connected to the junction bus 572. This register is used 
VME bus 120. as is described in more detail below. mostly for diagnostics, initialization, and clearinrof 

ul^" ™"°P;°*=f -"^"^g^^ data movement large blocks of data in system memory 116. * 

using a set of 5 channe s, each of which has an unique The movement of data between the FIFOs 544 and 
f^ni' current State. Channels are 60 554, the local data buffer 564. and a remote entity such 

implemented using a channel enable register 560 and a as the system memory 116 on the VME bus 120 is all 
hn, ^'i^Tr^f ' '1^' ''k? ^°"."^"!fl »° «'ntrolled by a VME/FIFO DMA controller 580. The 

lH.^ ' f f ^"^u ^ ^ " VME/FIFO DMA controller 580 is similar to the 

wnte-only register, whereas the channel status register VME/FIFO DMA controller 272 on network control- 
! i '^S'"*" The two registers 65 ler llOa (FIG. 3), and is described in the Appetidix. 

reside at the same address to microprocessor 510. The Briefiy, it includes a bit slice engine 582 and a dual-pori 
microprocessor 510 enables a particular channel by static RAM 584. One port of the dual-port static RAM 
setting Its respective bit m channel enable register 560. 584 communicates over the 32-bit microprocessor data 
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bus 512 with microprocessor 510, and the other port cally for message passing, are usually performed using 

communicates over a separate 16-bit bus with the bit the FIFO 554. 

slice engine 582. The microprocessor 510 places com- The SP 114a also includes a series of registers 592, 

mand parameters in the dual-port RAM 584, and uses similar to the reg«ters 282 on NC llOfl (FIG 3) and the 
the channel enables 560 to signal the VME/FIFO 5 registers 382 on FC 112^ (FIG. 4). The details of these 

DMA controller 580 to proceed with the command. registers are not important for an understanding of the 

The VME/FIFO DMA controller is responsible for present invention. 

scanning the channel status and servicing requests, and STORAGE PROCESSOR OPERATION 

wntes into the dual-port RAM 584 the desired coin- ^^^^ ^ysical 

mand and associated parameters for the des.red chan- arm intention. The tenth drive 

nel. For example, the command might be copy a b ock ^ ^^^^ 5 (redundant 

of data from FIFO 544A out into a block of systetn i„exV«.sive drives, revision 5) is described in 

memory 116 beginning at a specified VME address^ "A Case For a Redundant Arrays of Inexpensive Disks 

Second, the microprocessor sets the channel enable bit (^^10)", by Patterson et al., published at ACM SIG- 

in channel enable register 560 for the desired channel. Conference, Chicago, 111., Jun. 1-3, 1988, incor- 

At the time the channel enable bit is set, the appropn- herein by reference, 

ate FIFO may not yet be ready to send data. Only when ^^jj^ 5 ^^^-^^ j^^j^ divided into 

the VME/FIFO DMA controller 580 does receive a ^^^^^ Qaja stripes are recorded sequentially on eight 

"ready" status from the channel, will the controller 580 jjjfferent disk drives. A ninth parity stripe, the exclu- 

execute the command, In the meantime, the DMA con- jive-or of eight data stripes, is recorded on a ninth drive, 

troller 580 is free to execute commands and move data jj. ^ ^^^-^^ ^j^^ set to 8 K bytes, a read of 8 K of data 

to or from other channels. involves only one drive. A write of 8 K of data involves 

When the DMA controller 580 does receive a status jq drives: a data drive and a parity drive. Since a write 

of "ready" from the specified channel, the controller requires the reading back of old data to generate a new 

fetches the channel command and parameters from the parity stripe, writes are also referred to as modify 

dual-ported RAM 584 and executes. When the com- writes. The SP 114<J supports nine small reads to nine 

mand is complete, for example all the requested data has gCSI drives concurrently, When stripe size is set to 8 K, 
been copied, the DMA controller writes status back 35 ^ ^ead of 64 K of data starts all eight SCSI drives, with 

into the dual-pon RAM 584 and asserts "done" for the gggji jjive reading one 8 K stripe worth of data. The 

channel in channel status register 562, The micro- parallel operation is transparent to the caller client, 

processor 510 is then interrupted, at which time it reads The parity stripes are rotated among the nine drives 

channel status register 562 to determine which channel {„ order to avoid drive contention during write opera- 
interrupted. The microprocessor 510 then clears the 40 tions. The parity stripe is used to improve availability of 

channel enable for the appropriate channel and checks data. When one drive is down, the SP 114a can recon- 

the ending channel status in the dual-port RAM 584. struct the missing data from a parity stripe. In such case, 

In this way a high-speed data transfer can take place the SP 114c is running in error recovery mode. When a 

under the control of DMA controller 580, fully in paral- bad drive is repaired, the SP 114a can be instructed to 

lei with other activities being performed by micro- 45 restore daU on the repaired drive while the system is 

processor 510. The data transfer takes place over busses on-line. 

different from microprocessor data bus 512, thereby When the SP 114a is used to attach thirty indepen- 

avoiding any interference with microprocessor instruc- (Jent SCSI drives, no parity stripe is created and the 

tion fetches. client addresses each drive directly. 

The SP 114o also includes a high-speed register 590, 50 The SP 114a processes multiple messages (transac- 

which is coupled between the microprocessor data bus tions, commands) at one time, up to 200 messages per 

512 and the local data bus 532. The high-speed register second. The SP 114a does not initiate any messages 

590 is used to write a single 32-bit word to an VME bus after initial system configuration. The following SP 

urget with a minimum of overhead. The register is ii4a operations are defined: 

write only as viewed from the microprocessor 510. In 55 

order to write a word onto the VME bus 120, the mi- — 

croprocessor 510 first writes the word into the register configuration Data 

590, and the desired VME target address into dual-port 03 Receive Configuration Data 

RAM 584. When the microprocessor 510 enables the 05 Read and Write Sectors 

appropriate channel in channel enable register 560, the 60 « S^Ji^^^i^ 

DMA controller 580 transfers the data from the register J, Loca, Buffer 

590 into the VME bus address specified in the dual-port 09 Start/Stop A SCSI Drive 

RAM 584. The DMA controller 580 then writes the OC Inquiry 

endingstatustothedual-portRAMandsetsthechannel .BK^^M»^^li^B.n. 

"done" bit in channel status register 562. 65 

This procedure is very efTicient for transfer of a single ^ ^ , • u 

word of data, but becomes inefficient for large blocks of The above transactions are descnbed in detail in the 

data. Transfers of greater than one word of data, typi- above-identified application entitled MULTIPLE FA- 
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CILITY OPERATING SYSTEM ARCHITEC- If no error conditions are detected from the SCSI 
TURE. For and understanding of the invention, it will disk drives, the command is completed normally. When 
be useful to describe the function and operation of only a data check error condition occurs and the SP 11*0 is 
two of these commands: read and write sectors, and configured for RAID 5, recovery actions using redun- 
read and write cache pages. 5 dant data begin automatically. When a drive is down 

Read and Write Sectors ^Jj?'^ ^.f "^'Z configured for RAID 5 recovery 

actions similar to dau check recovery take place. 

This command, issued usually by an FC 112, causes 
the SP 114c to transfer data between a specified block Read/Write Cache Pages 

of system memory and a specified series of contiguous 10 This command is similar to read and write sectors, 
sectors on the SCSI disks. As previously described in except that multiple VME addresses are provided for 
connection with the file controller 112, the particular transferring disk daU to and from system memory 116. 
sectors are identified in physical terms. In particular, Each VME address points to a cache page in system 
the particular disk sectors are identified by SCSI chan- memory 116, the size of which is also specified in the 
nel number (0-9), SCSI ID on that channel number 15 command. When transferring data from a disk to system 
(ft-2), starting sector address on the specified drive, and memory 116, data are scattered to different cache pages; 
a count of the number of sectors to read or write. The when writing data to a disk, data are gathered from 
SCSI channel number is zero iftheSP 1140 is operating different cache pages in system memory 116. Hence, 
under RAID 5. this operation is referred to as a scatter-gather function. 

The SP 11*0 can execute up to 30 messages on the 30 20 The target sectors on the SCSI disks are specified in 
SCSI drives simultaneously. Unlike most of the com- the command in physical terms, in the same manner that 
mands to an SP 114, which are processed by micro- they are specified for the read and write sectors corn- 
processor SIO as soon as they appear on the command mand. Termination of the command with or without 
FIFO 534, read and write sectors commands (as well as error conditions is the same as for the read and write 
read and write cache memory commands) are first 25 sectors command. 

sorted and queued. Hence, they are not served in the The dual-port RAM 584 in the DMA controller 580 
order of arrival. maintains a separate set of commands for each channel 

When a disk access command arrives, the micro- controlled by the bit slice engine 582. As each channel 
processor 510 determines which disk drive is targeted completes its previous operation, the microprocessor 
and inserts the message in a queue for that disk drive 30 510 writes a new DMA operation into the dual-port 
sorted by the target sector address. The microprocessor RAM 584 for that channel in order to satisfy the next 
510 executes commands on all the queues simulta- operation on a disk elevator queue, 
neously, in the order present in the queue for each disk The commands written to the DMA controller 580 
drive. In order to minimize disk arm movements, the include an operation code and a code indicating 
microprocessor 510 moves back and forth among queue 35 whether the operation is to be performed in non-block 
entries in an elevator fashion. mode, in standard VME block mode, or in enhanced 

block mode. The operation codes supported by DMA 
controller 580 are as follows: 



OP CODE OPERATION 



3 ZEROES -> VMEbus 



4 VMEbus -> BITFFER 



Move data from the VME 



1 VMEbus 



from VME l)us onto a 
drive. Since RAID 5 
requires redundancy data 
10 Ik generated from data 



operation will be us 
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OP CODE OPERATION 



VMEbus -> BUFFER 4 F 



BUFFER -> VMEbus 
BUFFER -> FIFO 



caplured in the local 
data bufTer 564 (at 
participaiicm in 
TCdundancy generation. 
Used only if SP 11 4a is 
conngured for RAID S 

This operation is not 
used. 



FIFO -> VMEbus 



FIFO -> BUFFER 



drive. Used only in 
RAID 5 applications. 
This operation is used t 
move target data direct 
from a disk drive onto 
the VMEbus 120. 
Used to move 
parlicipating data for 



FIFO -> VMEbus & BUFFER 



RAID 5 applications. 
This operation is used to 
save target data for 



SYSTEM MEMORY 
FIG. 6 provides a simplified block diagram of the 
preferred architecture of one of the system memory 
cards 116a. Each of the other system memory cards are 
the same. Each memory card 116 operates as a slave on 
the enhanced VME bus 120 and therefore requires no 
on-board CPU. Rather, a timing control block 610 is 
sufficient to provide the necessary slave control opera- 
tions. In particular, the timing control block 610, in 
response to control signals from the control portion of 
the enhanced VME bus 120, enables a 32-bit wide buffer 
612 for an appropriate direction transfer of 32-bit data 
between the enhanced VME bus 120 and a multiplexer 
unit 614. The multiplexer 614 provides a multiplexing 
and demultiplexing function, depending on data transfer 
direction, for a six megabit by seventy-two bit word 
memory array 620. An error correction code (ECC) 
generation and testing unit 622 is also connected to the 
multiplexer 614 to generate or verify, again depending 
on transfer direction, eight bits of ECC data. The sutus 
of ECC verification is provided back to the timing con- 
trol block 610. 

ENHANCED VME BUS PROTOCOL 
VME bus 120 is physically the same as an ordinary 
VME bus, but each of the NCs and SPs include addi- 
tional circuitry and firmware for transmitting data using 
an enhanced VME block transfer protocol. The en- 
hanced protocol is described in detail in the above-iden- 
tified application entitled ENHANCED VMEBUS 
PROTOCOL UTILIZING PSEUDOSYNCHRO- 
NOUS HANDSHAKING AND BLOCK MODE 
DATA TRANSFER, and summarized in the Appendix 
hereto. Typically transfers of SNFS file data between 
NCs and system memory, or between SPs and system 



memory, and transfers of packets being routed from one 
35 NC to another through system memory, are the only 
types of transfers that use the enhanced protocol in 
server 100. All other data transfers on VMEbus 120 use 
either conventional VME block transfer protocols or 
ordinary non-block transfer protocols. 
^ MESSAGE PASSING 

As is evident from the above description, the differ- 
ent processors in the server 100 communicate with each 
other via certain types of messages. In software, these 
45 messages are all handled by the messaging kernel, de- 
scribed in detail in the MULTIPLE FACILITY OP- 
ERATING SYSTEM ARCHITECTURE application 
cited above. In hardware, they are implemented as 
follows. 

SO Each of the NCs 110, each of the FCs 112, and each 
of the SPs 114 includes a command or communication 
FIFO such as 290 on NC llOo. The host 118 also in- 
cludes a command FIFO, but since the host is an un- 
modified purchased processor board, the FIFO is emu- 

55 lated in software. The write port of the conmand FIFO 
in each of the processors is directly addressable from 
any of the other processors over VME bus 120. 

Similarly, each of the processors except SPs 114 also 
includes shared memory such as CPU memory 214 on 

60 NC llOa. This shared memory is also directly address- 
able by any of the other processors in the server 100. 

If one processor, for example network controller 
llOo, is to send a message or command to a second 
processor, for example file controller 112a, then it does 

65 so as follows. First, it forms the message in its own 
shared memory (e.g., in CPU memory 214 on NC llOo). 
Second, the microprocessor in the sending processor 
directly writes a message descriptor into the command 
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FIFO in the receiving processor. For a command being As mentioned above, network controller llOa uses 
sent from network controller 110a to file controller the buffer 284 data path in order to write message de- 
112a the microprocessor 210 would perform the write scriptors onto the VME bus 120, and uses VME/FIFO 
via buffer 284 on NC IlOa, VME bus 120, and buffer DMA controller 272 together with parity FIFO 270 in 
384 on file controller 112fl. 5 order to copy messages from the VME bus 120 into 

The command descriptor is a single 32-hit word con- CPU memory 214 Other processors read from CPU 
taining m its high order 30 bits a VME address indicat- memory 214 using the buffer 284 data path, 
ing the start of a quad-aligned message in the sender's File controller 12o writes message descriptors onto 
shared memory. The low order two bits indicate the the VME bus 20 using the buffer 384 data path, and 
■— e type as follows: !0 copies messages from other processors' shared memory 

via the same data path. Both take place under the con- 

trol of microprocessor 310. Other processors copy mes- 

sages from CPU memory 314 also via the buffer 384 

data path. 

15 Storage processor 114a writes message descriptors 
onto the VME bus using high-speed register S90 in the 

also message acknowledgment manner described above, and copies messages from 

other processors using DMA controller 580 and FIFO 
All messages are 128-bytes long. shared memory, however, so it 

When the receiving processor reaches the command ^° * ^^^^^ system memory 116 to emulate that 
descriptor on its command FIFO, it directly accesses function. That is, before it writes a message descriptor 
the sender's shared memory and copies it into the re- another processor's command FIFO, the SP 114a 

ceiver's own local memory. For a command issued '=°P'^^ message into its own previously allo- 

from network controller llOc to file controller 112a, cated buffer in system memory 116 using DMA control- 
this would be an ordinary VME block or non-block '^"^ ^"'^ ^'^^ '54. The VME address included in 
mode transfer from NC CPU memory 214, via buffer message descriptor then reHects the VME address 

284, VME bus 120 and buffer 384, into FC CPU mem- °^ message in system memory 116. 
ory 314. The FC microprocessor 310 directly accesses command FIFO and shared mem- 

NC CPU memory 214 for this purpose over the VME """^ emulated in software, 

bus 120. The invention has been described with respect to 

When the receiving processor has received the com- panicular embodiments thereof, and it will be under- 
mand and has completed its work, it sends a reply mes- numerous modifications and variations are 

sage back to the sending processor. The reply message possible within the scope of the invention, 
may be no more than the original command message 35 APPENDIX A 

unaltered, or it may be a modified version of that mes- 
sage or a completely new message. If the reply message VME/FIFO DMA Controller 
is not identical to the original command message, then In storage processor 114o. DMA controller 580 man- 
the receiving processor directly accesses the original ages the data path under the direction of the micro- 
sender's shared memory to modify the original com- 40 processor 510. The DMA controller 580 is a micro- 
mand message or overwrite it completely. For replies coded 16-bit bit-slice implementation executing pipe- 
from the FC I2a to the NC 110a, this involves an ordi- lined instructions at a rate of one each 62.5 ns. It is 
nary VME block or non-block mode transfer from the responsible for scanning the channel status 562 and 
FC 12a, via buffer 384, VME bus 120, buffer 284 and servicing request with parameters stored in the dual- 
into NC CPU memory 214. Again, the FCmicroproces- 45 ported ram 584 by the microprocessor 510. Ending 
sor 310 directly accesses NC CPU memory 214 for this status is returned in the ram 584 and interrupts are gen- 
purpose over the VME bus 120. erated for the microprocessor 510. 
Whether or not the original command message has Control Store 
- been changed, the receiving processor then writes a The control store contains the microcoded inslruc- 
reply message descriptor directly into the original send- jo lions which control the DMA controller 580. The con- 
er'& command FIFO. The reply message descriptor trol storeconsists of 6 1 Kx8 proms configured toyield 
contains the same VME address as the original com- a 1 Kx48 bit microword. Locations within the control 
mand message descriptor, and the low order two bits of store are addressed by the sequencer and data is pres- 
the word are modified to indicate that this is a reply ented at the input of the pipeline registers, 
message. For replies from the FC 112a to the NC 110a, 35 Sequencer 

the message descriptor write is accomplished by micro- The sequencer controls program fiow by generating 
processor 310 directly accessing command FIFO 290 control store addresses based upon pipeline data and 
via buffer 384, VME bus 120 and buffer 280 on the NC. various status bits. The control store address consists of 
Once this is done, the receiving processor can free the 10 bits. Bits 8:0 of the control store address derive from 
buffer in its local memory containing the copy of the 60 a multiplexer having as its inputs either an ALU output 
command message. or the output of an incrementer. The incrementer can be 

When the original sending processor reaches the preloaded with pipeline register bits 8:0, or it can be 
reply message descriptor on its command FIFO, it incremented as a result of a test condition. The 1 K 
wakes up the process that originally sent the message address range is divided into two pages by a latched flag 
and permits it to continue. After examining the reply 65 such that the microprogram can execute from either 
message, the original sending processor can free the page. Branches, however remain within the selected 
original command message buffer in its own local page. Conditional sequencing is performed by having 
shared memory. the test condition increment the pipeline provided ad- 
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dress. A false condition allows execution from the pipe- 
line address while a true condition causes execution 
from the address + 1. The alu output is selected as an 
address source in order to directly vector to a routine or 
in order to return to a calling routine. Note that when 5 ^ 

calling a subroutine the calhng routine must reside 

within the same page as the subroutine or the wrong 
page will be selected on the return. 

ALU 

The alu comprises a single IDT49C402A integrated 10 pf<2K)> 

circuit. It is 16 bits in width and most closely resembles 

four 2901s with 64 registers. The alu is used primarily pp. 

for incrementing, decrementing, addition and bit manip- 
ulation. All necessary control signals originate in the 
control store. The IDT HIGH PERFORMANCE 15 
CMOS 1988 DATA BOOK, incorporated by reference 
herein, contains additional information about the alu. 
Microword 

The 48 bit microword comprises several fields which 
control various functions of the DMA controller 580. 20 
The format of the microword is defined below along 
with mnemonics and a description of each function. 
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20,21 <»lu bus SouRCe seleci bils 1,0) These 
bits select the data source to be 
enabled onio the alu bus. 



47:39 (Alu Instruction bits tSI) The AI bits 

provide the instruction for the 49C402A 
alu. Refer to the IDT data book for a 
complete definition oT the alu 



25 



of bufler clock. 
SET_VB 



SET_DN 

set channel done suius for 
the currently selected 



RESERVED - NOT DEFINED 



37:32 {Register A address bits 50) These bi 
operand for the alu These bits also 



(Latched Flag Data) W 
causes the selected laid 
set When reset this bit 
selected latched flag to be cleared. 
This bits also functions as literal bit 
3 for the alu bus. 
2 (Latched Flag Select bits 2K» The 
meaning of these bits is dependent upon 
the selected source for the alu bus. In 
the event that the literal field is 

LFS<2«> function as literal bits <2:0> 

othenik'ise the bits ire used to select 

one of the latched llags. 



SELECTED FLAG 



This value selects a null 
nag. 

When set this bit enables the 
buffer clock. When reset this 
bh disables the buffer 
dock. 

When this bit is cleared VME 
bus Iransfets, buffer 

ns and RAS are all 



bus tri 



buffer operations. 

When set this bit asserts the 

row address sin^ to the dram 

buffer. 

When set this bit selects page 
0 of the control store. 



WR_RAM 

causes the data on the alu bu 
to be written to the dual 

D<I3K)> -.rani<lSK)> 
WR_BADD 

loads the data from the alu 
bus into the dram address 



li the di 



2 bytes of the VME address 

D<15;2> — VMEaddr<15:: 

Dl — ENB_ENH 

DO — ENB_BLK 

WR_VADH 

loads the most significant 2 

bytes of the VME address 



DlS->coun 
D<U:g> - 
WR CO 



D<7:4> -.C0<30> 
WR_NXT 

loads the re»t-channel select 

D<3"> - NEXT<3K)> 
WR_CUR 

loads the current-channel 
select register. 
D<3«> — CURR <3*> 
RESERVED - NOT DEFINED 
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JUMP 

causes the control store 
sequencer to select the alu 

D<8:0> CS_A<8:0> 

12:9 (TEST condition select bits 3KI) Selei 
one of 16 inputs to the test 
muhiplexor to be used as the carry 
input to the incrementer. 



FALSE 
TRUE 

ALU_COUT 
ALU_EQ 

ALU_OVR 

ALU_NEG 

XFR_DONE 

PAR_ERR 

TIMOUT 

ANY_ERR 
RESERVED 
CH_RDY 



-carry output of alu 



point to the current command block. The current com- 
mand block pointer should be initialized to 0 x 0000 by 
the microprocessor 510 before enabling the channel. 
Upon detecting a value of 0 x 0000 in the current block 

5 pointer the DMA controller 580 will copy the lower 16 
bits from the initial pointer to the current pointer. Once 
the DMA controller 580 has completed the specified 
operations for the parameter block the current pointer 
will be updated to point to the next block. In the event 
that no further parameter blocks are available the 
pointer will be set to 0x0000. 

The status byte indicates the ending status for the last 
channel operation performed. The following status 

J bytes are defined: 




NEXT_A<8.0> 



ILLEGAL OP CODE 
BUS OPERATION TIMEOUT 
BUS OPERATION ERROR 
DATA PATH PARITY ERROR 
The formal of the parameter block is: 
ET 31 



Dual Ported Ram C 

The dual ported ram is the medium by which com- 
mand, parameters and status are communicated be- 
tween the DMA controller 580 and the microprocessor 30 

510. The ram is organized as IK x 32 ai the master port c+axn) ITERM n I op n I euf addr n 'I 

and as 2 K X 1 6 at the DMA port. The ram may be both ~ ~~~ 

written and read at either port. 

The ram is addressed by the DMA controller 580 by FORWARD LINK— The forward link pomts to the 
loading an 1 1 bit address into the address counters. Data 35 "^^^ parameter block for execution. It 

is then read into bidirectional registers and the address "Hows several parameter blocks to be initialized and 
counter is incremented to allow read of the next loca- chained to create a sequence of operations for execu- 
tion, tion. The forward pointer has the following format: 

Writing the ram is accomplished by Icjading data 
from the processor into the registers after loading the 40 A3I:A2,o,o 
ram address. Successive writes may be performed on 

every other processor cycle. Th^ format dictates that the parameter block must start 

The ram contains current block pointers, ending sta- °" ^ <1"^'' boundary. A pointer of 0 x 00000000 is a 
tus, high speed bus address and parameter blocks. The special case which indicates no forward link exists, 
following is the format of the ram: 45 WORD COUNT 

The word count specifies the number of quad byte 
words that are to be transferred to or from each buffer 
address or to/from the VME address. A word count of 
64 K words may be specified by initializing the word 
SO count with the value of 0. The word count has the 
following format: 

The word count is updated by the DMA controller 
580 at the completion of a transfer to/from the last 
specified buffer address. Word count is not updated 
after transferring to/from each buffer address and is 
60 therefore not an accurate indicator of the total data 
moved to/from the buffer. Word count represents the 
amount of data transferred to the VME bus or one of 
the FIFOs 544 or 554. 
VME ADDRESS 

The VME address specifies the starting address for 
data transfers. Thirty bits allows the address to start at 
any quad byte boundary. 
ENH 



OFFSET 



I PARAMETER BLOCK n [ 

The Initial Pointer is a 32 bit value which points the 
first command block of a chain. The current pointer is a 
sixteen bit value used by the DMA controller 580 to 
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This bit when set selects the enhanced block transfer APPENDIX B 
protocol described in the above-cited ENHANCED 

VMEBUS PROTOCOL UTILIZING PSEUDOSYN- Enhanced VME Block Transfer Protocol 
CHRONOUS HANDSHAKING AND BLOCK -j^e enhanced VME block transfer protocol is a 
MODE DATA TRANSFER application, to be used 5 vMEbus compatible pseudo-synchronous fast transfer 
during the VME bus transfer. Enhanced protocol will handshake protocol for use on a VME backplane bus 
be disabled automatically when performing any transfer having a master functional module and a slave func- 
to or from 24 bit or 16 bit address space, when the tional module logically interconnected by a data trans- 
starting address is not 8 byte aligned or when the word jhe data transfer bus includes a data strobe 
count is not even. signal line and a data transfer acknowledge signal line. 

BLK To accomplish the handshake, the master transmits a 

This bit when set selects the conventional VME jata strobe signal of a given duration on the data strobe 

block mode protocol to be used during the VME bus -rhe master then awaits the reception of a data 

transfer. Block mode will be disabled automatically transfer acknowledge signal from the slave module on 

when performing any transfer to or from 16 bit address jjg^, transfer acknowledge signal line. The slave 

space. then responds by transmitting data transfer acknowl- 

BUF ADDR edge signal of a given duration on the data transfer 

The buffer address specifies the starting buffer ad- acknowledge signal line, 
dress for the adjacent operation. Only 16 bits are avail- Consistent with the pseudo-synchronous nature of 
able for a IM byte buffer and as a result the starting handshake protocol, the data to be transferred is 
address always falls on a 16 byte boundary. The pro- referenced to only one signal depending upon whether 
grammer must ensure that the starting address is on a transfer operation is a READ or WRITE operation, 
modulo 128 byte boundary. The buffer address is up- j„ transferring data from the master functional unit to 
dated by the DMA controller 580 after completion of 25 the slave, the master broadcasts the data to be trans- 
each data burst. ferred. The master asserts a data strobe signal and the 

slave, in response to the data strobe signal, captures the 

'ici Aifi?l Ai^Alli'*'^"'*" ^ «lata broadcast by the master. Similarly, in transferring 

|AtO|A9|A«|A7|A*iAS|A4j ^^^^ ^^^^ ^^^^^ ^^^^^ broadcasts 

30 the data to be transferred to the master unit. The slave 

cperationstoperform unt.l th,sb,t tsencoun^^^^^ 35 j;^;™on. frcilitates the rapid transfer of large 

the last operation withm *e parameter block is exe- ' backplane bus by sub- 

cuted the word counter .s updated and ^ not equal to transfer rate. These data 

zerothesenesofoperattons^re^^^^^^^^ at« are achieved by using a handshake wherein the 

counter reaches zero the forward link pointer is used to ^ ^^^^ ^^^^^^^^ acknowledge signals are 

access the next parameter block. 40 f^f^^XioMWy decoupled and by specifying high current 

loioioioioioioioiTI drivers for all data and control lines. . ^ , ^ 

|0|0|0|«!O|D|o|oi I enhanced pseudo-synchronous method of data 

Qp transfer (hereinafter referred to as "fast transfer mode") 

Operations are specified by the op code. The op code . , is implemented so as to comply and be compatible with 
by^Sas the follo^^ng format: the IEEE VME backplane bus sUmdard^The protocol 

' * utilizes user-defined address modifiers, defined m the 

|0l0|0|0|OP3|OPJ|OPi|OPt)| VMEbus standard, to indicate use of the fast transfer 

mode. Conventional VMEbus functional units, capable 
The op codes are listed below ("FIFO" refers to any of ^ only of implementing standard VMEbus protocols, will 
the FIFOs 544 or 554); ignore transfers made using the fast transfer mode and, 

as a result, are fully compatible with functional units 

capable of implementing the fast transfer mode. 

OP CODE OPERATION .j-j^^ ,ransfer mode reduces the number of bus 

0 NO-OP propagations required to accomplish a handshake from 

' f FHOFS FIFO four propagations, as required under conventional 

3 ZEROES > VMEbus VMEbus protocols, to only two bus propagations. 

4 VMEbus -> BUFFER Likewise, the number of bus propagations required to 

5 VMEbus -> FIFO efi-ect a BLOCK READ or BLOCK WRITE data 
S 60 transfer is reduced. Consequently, by reducing the 
g SumS l FIFO propagations across the VMEbus to accomplish hand- 
» FIFO -> VMEbus shaking and data transfer functions, the transfer rate is 
A FIFO BUFFER materially increased. 

B FIFO -> VMEbus & BUFFER ^^^^ enhanced protocol is described in detail in the 
D RkIrvId 65 above-cited ENHANCED VMEBUS PROTOCOL 
E RESERVED application, and will only be summarized here. Famil- 
F RESERVED iarfty with the conventional VME bus standards is as- 
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In the fast transfer mode handshake protocol, only The slave modules connected to the DTB receive the 
two bus propagations are used to accomplish a hand- address and the address modifier broadcast by the mas- 
shake, rather than four as required by the conventional ter across the bus and receive LWORD* low and 
protocol. At the initiation of a data transfer cycle, the lACK* high 703. Shortly after broadcasting the address 
master wUl assert and deassert DSO* in the form of a 5 and address modifier 701, the master drives the AS* 
pulse of a given duration. The deassertion of DSO* is signal low 705. The slave modules receive the AS* low 
accomplished without regard as to whether a response signal 707. Each slave individually determines whether 
has been received from the slave. The master then waits i, will participate in the data transfer by determining 
for an acknowledgement from the slave. Subsequent whether the broadcasted address is valid for the slave in 
nrT^if.c ?• ""'V ^"P°"^'^^ '° question 709. Ifthe address is not valid, the data transfer 
DTACK* signal is received from the slave. Upon re- does not involve that particular slave and it ignores the 
ceivmg the slave s assertion of DTACK* the master .^.i^^^, data transfer cycle. ^ 
Z S 'Z!ft' V'*'''r '^'""t "^^ drives WRITE* low to indicate that the 
^^tlTio Z lr r^."^^^^^ to » « WRITE operation 

of DSO*. In the fast' transfer mode, only the lealg wr'iTC oZiio^ awai.fl^ nTo^^^^ " 
edge (i.e., the assertion) of a signal is significant. Thus Z^J^. ?hfn^"' 1 tk . v 

the deassertion of either DSO* or DTACK* is com^ n k i^rf V^^pl. ^^^^Tn.T^^ . ""u 

pletely irrelevant for completion of a handshake. The 20 " . ' ^°''! J^7,^^^ '"'^ ^F'^^ 718, which 

fast transfer protocol does not employ the DSl* line for nxR " "° 

data strobe purposes at all. ir ^ , 

The fast transfer mode protocol may be characterized ^ 7 f TV P'?^"^ to place the first segment of the 
as pseudo-synchronous as it includes both synchronous ^fl^ 1° ^ transferred on data lines DOO through D31. 
and asynchronous aspects. The fast transfer mode pro- 25 "^"ifeif^f' "^^'If' ^ °" ^ "'"'"8'' t***"^'*' 
tocol is synchronous in character due to the fact that .^^'^ '""^ ^^'^"^ « predetermined inter- 

DSO* is asserted and deasserted without regard to a ^'^^ 

response from the slave. The asynchronous aspect of " response to the transition of DSO* from high to 
the fast transfer mode protocol is attributable to the fact respectively 721 and 723, the slave latches the data 

that the master may not subsequently assert DSO* until 30 ''^'"8 transmitted by the master over data lines DOO 
a response to the prior strobe is received from the slave. through D31, 725. The master places the next segment 
Consequently, because the protocol includes both syn- °^ ***** ^ transferred on data lines DOO through 
chronous and asynchronous components, it is most D31, 727, and awaits receipt of a DTACK* signal in the 
accurately classified as "pseudo-synchronous." * high to low transition signal, 729 in FIG. 7B. 

ThetransferofdataduringaBLOCK WRITE cycle 35 Referring to FIG. 7B, the slave then drives 
in the fast transfer protocol is referenced only to DSO*. DTACK* low, 731, and, after a predetermined period 
The master first broadcasts valid data to the slave, and °^ time, drives DTACK high, 733. The data latched by 
then asserts DSO to the slave. The slave is given a prede- the slave, 725, is written to a device, which has been 
termined period of time after the assertion of DSO* in selected to store the data 735. The slave also increments 
which to capture the data. Hence, slave modules must 40 the device address 735. The slave then waits for another 
be prepared to capture data at any time, as DTACK* is transition of DSO* from high to low 737. 
not referenced during the transfer cycle. To commence the transfer of the next segment of the 

Similariy. the transfer of data during a BLOCK Wock of data to be transferred, the master drives DSO* 
READ cycle in the fast transfer protocol is referenced 'ow after a predetermined period of time, 

only to DTACK*. The master first asserts DSD*. The 45 <lrives DSO* high 741. In response to the transition of 
slave then broadcasts data to the master and then asserts DSO* from high to low, respectively 739 and 741, the 
DTACK*. The master is given a predetermined period »lave latches the data being broadcast by the master 
of time after the assertion of DTACK in which to cap- over data lines DOO through D31, 743. The master 
ture the data. Hence, master modules must be prepared places the next segment of the data to be transferred on 
to capture data at any time as DSO is not referenced 50 <l«ta lines DOO through D31, 745, and awaits receipt of 
during the transfer cycle. a DTACK* signal in the form of a high to low transi- 

FIG. 7, parts A through C, is a flowchart illustrating tion, 747. 
the operations involved in accomplishing the fast trans- TTte slave then drives DTACK* low, 749, and, after 
fer protocol BLOCK WRITE cycle. To initiate a a predetermined period of time, drives DTACK* high. 
BLOCK WRITE cycle, the master broadcasts the 55 751. The data latched by the slave, 743, is written to the 
memory address of the data to be transferred and the device selected to store the data and the device address 
address modifier across the DTB bus. The master also is incremented 753. The slave waits for another transi- 
drives interrupt acknowledge signal (lACK*) high and tion of DSO* from high to low 737. 
the LWORD* signal low 701. A special address modi- The transfer of data will continue in the above- 
fier, for example "IF," broadcast by the master indi- 60 described manner until all of the data has been trans- 
cates to the slave module that the fast transfer protocol ferred from the master to the slave. After all of the data 
will be used to accomplish the BLOCK WRITE. has been transferred, the master will release the address 

The starting memory address of the data to be trans- lines, address modifier hnes, data lines. lACK* line, 
ferred should reside on a 64-bit boundary and the size of LWORD* line and DSO* line, 755. The master will 
block of data to be transferred should be a multiple of 64 65 then wait for receipt of a DTACK* high to low transi- 
bits. In order to remain in compUance with the VMEbus tion 757. The slave will drive DTACK* low, 759 and, 
standard, the block must not cross a 256 byte boundary after a predetermined period of time, drive DTACK* 
without performing a new address cycle. high 761. In response to the receipt of the DTACK* 
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high to low transition, the master will drive AS* high device selected to store the data. After all of the data to 

763 and then release the AS* line 765. be transferred has been written into the storage device, 

FIG. 8, parts A through C, is a flowchart illustrating the master will release the address lines, address modi- 

the operations involved in accomplishing the fast trans- fier lin^, data lines, the lACK* line, the LWORD line 

fer protocol BLOCK READ cycle. To initiate a 5 and DSO* line 852. The master will then wait for receipt 

BLOCK READ cycle, the master broadcasts the mem- of a DTACK* high to low transition 853. The slave will 

ory address of the data to be transferred and the address drive DTACK* low 855 and, after a predetermined 

modifier across the DTB bus 801. The master drives the period of time, drive DTACK* high 857. In response to 

LWORD* signal low and the lACK* signal high 801. the receipt of the DTACK* high to low transition, the 

As noted previously, a special address modifier indi- 10 master will drive AS. high B59 and release the AS* line 

cates to the slave module that the fast transfer protocol 861. 

will be used to accomplish the BLOCK READ. To implement the fast transfer protocol, a conven- 

The slave modules connected to the DTB receive the tional 64 mA tri-state driver is substituted for the 48 mA 

address and the address modifier broadcast by the mas- open collector driver conventionally used in VME 

ter across the bus and receive LWORD* low and 15 slave modules to drive DTACK*. Similariy, the con- 

lACK* high 803. Shortly after broadcasting the address ventional VMEbus data drivers are replaced with 64 

and address modifier 801, the master drives the AS* mA tri-state drivers in SO-type packages. The latter 

signal low 805. The slave modules receive the AS* low modification reduces the ground lead inductance of the 

signal 807. Each slave individually determines whether actual driver package itself and, thus, reduces "ground 

it will participate in the data transfer by determining 20 bounce" effects which contribute to skew between data, 

whether the broadcasted address is valid for the slave in dSO* and DTACK*. In addition, signal return induc- 

question 809. If the address is not valid, the data transfer tance along the bus backplane is reduced by using a 

does not involve that particular slave and it ignores the connector system having a greater number of grouiid 

remainder of the data transfer cycle. pins so as to minimize signal return and mated-pair pin 

The master drives WRITE* high to indicate that the 25 inductance. One such connector system is the "High 

transfer cycle about to occur is a READ operation 811. Density Plus" connector, Model No. 420-8015-000, 

The slave receives the WRITE* high signal 813 and, manufactured by Teradyne Corporation, 

knowing that the data transfer operation is a READ addcmimv r- 

operation, places the first segment of the data to be AFFtNUiA c 

transferred on data lines DOO through D31 819. The 30 Parity FIFO 
master will wait until both DTACK* and BERR* are 

high 818, which indicates that the previous slave is no ^^^"^^^^ and 544and 554(on storage processors 

longer driving the DTB. ^ implemented as an ASIC. AH the parity 

The master then 'l^^^^f^SO* low 82 and^^^^^^^ FIFOs are identi^. and are configured on power-up or 

predetermined ^7'%°^° , '"^^^ " during normal operation for the particular function 

• 1 u- v. av7 40 Icl SCSI drives. 

signal high 827. nTACK* from hich The FIFO comprises two bidirectional data ports, 

In «*P°"*^, ° latches ,S ^ and Port B. with 36 x 64 bits of RAM buffer 

is wriTten to k device, which has been selected to store 45 36x32 bits, designated RAM X and RAM Y.^e two 

he data the device address is incremented 833. ports access different ha^^ves °f *«^^uM 

The slave places the next segment of the data to be the other half when ^•^^^^^^.t'^p^^^^ 

transferred on data lines DOO through D31, 829, and wed as a parallel panty chip (e.g. mie of the FIF^ 544 

S w^« for »ofhir ILsition of DSO* from high to on SP 114a), all accuses on Port B «e ^^^ ^^ 

To commence the transfer ofthe next segment of the ""i^V- . , ^ /,™t . ^ k- s r,r 

block of data to be transferred, the master drives DSO* The chip also has a CPU interface, which may be 8 or 
low 839 and, after a predetermined period of time, 16 bits wide. In 16 bit mode the Port A pms are used 
drives DSO* high 841. The master then waits for the the most significant data bits of the CPU interface and 
DTACK* line to transition from high to low, 843. 55 are only actually used when reading or wntmg to the 

The slave drives DTACK* low, 845, and, after a Fifo Date Register mj'de the chip^ 
predetermined period of time, drives DTACK* high, AREQ, ACK handshake is used for data transfer on 
847 In response to the transition of DTACK* from both PorU A and B. The chip may be configured as 
high to low, respectively 839 and 841, the master either a master or a slave on Ftort Am the sense that, m 
latches the data being transmitted by the slave over data 60 master mode the Port A ACK / RDY output nBiunes 
lines DOO through D31, 845. The data latched by the that the chip is ready to transfer data on Port A, and the 
master 845 is written to the device selected to store the Port A REQ input specifies that the slave is respondmg. 
data 851 in FIG. 8C, and the device address is incre- In slave mode, however, the Port A REQ input sp«n- 
mented. The slave places the next segment of the data to fics that the master requires a data transfer, and the ctap 
be transferred on data lines DOO through D31, 849. 65 responds with Port A ACK / RDY when data is avail- 

The transfer of data wUl continue in the above- able. The chip is a master on Port B smce it raises Port 
described manner until all of the date to be transferred B REQ and waits for Port B ACK to indicate comple- 
from the slave to the master has been written into the tion of the date transfer. 



SIGNAL DESCRIPTIONS 



5,163,131 

45 46 

This is the difciliiit mode and the chip is reset to this 
condition. In this mode the chip waits for a master such 
Port A 0-7, P as one of the SCSI adapter chips 542 to raise Port A 

Port A is the 8 bit data port. Port A P, if used, is the Request for data transfer. If data is available the Fifo 
odd parity bit for this port. 5 chip will respond with Port A Ack/Rdy. 

A Req, A Ack/Rdy Port A WD Mode 

These two signals are used in the data transfer mode The chip may be configured to run in the WD or 
to control the handshake of data on Port A. Western Digital mode. In this mode the chip must be 

uP Data 0-7, uP Data P, uPAdd 0-2, CS configured as a slave on Port A. It differs from the 

These signals are used by a microprocessor to address 10 default slave mode in that the chip responds with Read 
the programmable registers within the chip. The odd Enable or Write Enable as appropriate together with 
parity signal uP Data P is only checked when data is Port A Ack/Rdy. This mode is intended to allow the 
written to the Fifo Data or Checksum Registers and chip to be interfaced to the Western Digital 33C93A 
microprocessor parity is enabled. SCSI chip or the NCR 53C90 SCSI chip. 

Clk 15 Port A Master Mode 

The clock input is used to generate some of the chip When the chip is configured as a master, it will raise 
timing. It is expected to be in the 10-20 Mhz range. Port A Aok/Rdy when it is ready for data transfer. This 

Read En, Write En signal is expected to be tied to the Request input of a 

During microprocessor accesses, while CS is true, DMA controller which will respond with Port A Req 
these signals determine the direction of the micro- 20 when data is available, In order to allow the DMA 
processor accesses. During data transfers in the WD controller to burst, the Port A Ack/Rdy signal will 
mode these signals are data strobes used in conjunction only be negated after every 8 or 16 bytes transferred, 
with Port A Ack. Port B Parallel Write Mode 

Port B 00-07, 10-17, 20-27, 30-37, 0P-3P In parallel write mode, the chip is configured to be 

Port B is a 32 bit data port. There is one odd parity bit 25 the parity chip for a parallel transfer from Port B to 
for each byte. Port B OP is the parity of bits 00-07. Port A. In this mode, when Port B Select and Port B 
PortB IP is the parity of bits 10-17, Port B 2P is the Request are asserted, data is written into RAM X or 
parity of bits 20-27, and Port B 3P is the parity of bits RAM Y each time the Port B Ack signal is received. 
30-31. For the first block of 128 bytes data is simply copied 

B Select, B Req, B Ack, Parity Sync. B Output En- 30 into the selected RAM. The next 128 bytes driven on 
able Port B will be exclusive-ORed with the first 128 bytes. 

These signals are used in the data transfer mode to This procedure will be repeated for all drives such that 
control the handshake of data on Port B. Port B Req the parity is accumulated in this chip. The Parity Sync 
and Port B Ack are both gated with Port B Select. The signal should be asserted to the parallel chip together 
Port B Ack signal is used to strobe the data on the Port 35 with the last block of 128 bytes. This enables the chip to 
B data lines. The parity sync signal is used to indicate to switch access to the other RAM and start accumulating 
a chip configured as the parity chip to indicate that the a new 128 bytes of parity, 
last words of data involved in the parity accumulation Port B Parallel Read Mode - Check Data 
are on Port B. The Port B data lines will only be driven This mode is set if all drives are being read and parity 
by the Fifo chip if all of the following conditions are 40 is to be checked. In this case the Parity Correct bit in 
•ns^: the Data Transfer Configuration Register is not set. The 

a. the data transfer is from Port A to Port B; parity chip will first read 128 bytes on Port A as in a 

b. the Port B select signal is true; normal read mode and then raise Port B Request. While 

c. the Port B output enable signal is true; and it has this signal asserted the chip will monitor the Port 

d. the chip is not configured as the parity chip or it is 45 B Ack signals and exclusive-or the data on Port B with 
in parity correct mode and the Parity Sync signal is the daU in its selected RAM. The Parity Sync should 
trtie. again be asserted with the last block of 1 28 bytes. In this 

Re^' mode the chip will not drive the Port B data lines but 

This signal resets all the registers within the chip and will check the output of its exclusive-or logic for zero. 

causes all bidirectional pins to be in a high impedance 50 If any bits are set at this time a parallel parity error will 

state. be fiagged. 

DESCRIPFION OF OPERATION Ptort B Parallel Read Mode - Correct Data 

This mode is set by settmg the Parity Correct bit in 
Normal Operation the Data Transfer Configuration Register. In this case 

Normally the chip acts as a simple FIFO chip. A 55 the chip will work exactly as in the check mode except 

FIFO is simulated by using two RAM buffers in a sim- that when Port B Output Enable. Port B Select and 

pie ping-pong mode. It is intended, but not mandatory. Parity Sync are true the data is driven onto the Port B 

that data is burst into or out of the FIFO on Port B. This data lines and a parallel parity check for zero is not 

is done by holding Port B Sel signal low and pulsing the performed. 

Port B Ack signal. When transferring data from Port B 60 Byte Swap 

to Port A, data is first written into RAM X and when In the normal mode it is expected that Port B bits 
this is full, the data paths will be switched such that Port 00-07 are the first byte, bits 10-17 the second byte, bits 
B may start writing to RAM Y. Meanwhile the chip 20-27 the third byte, and bits 30-37 the last byte of each 
will begin emptying RAM X to Port A. When RAM Y word. The order of these bytes may be changed by 
is full and RAM X empty the data paths will be 65 writing to the byte swap bits in the configuration regis- 
switched again such that Port B may reload RAM X ter such that the byte address bits are inverted. The way 
and Port A may empty RAM Y. the bytes are written and read also depend on whether 

Port A Slave Mode the CPU interface is configured as 16 or 8 bits. The 
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following table shows the byte alignments for the differ- 
ent possibilities for data transfer using the Port A Re- 
quest/Acknowledge handshake: 



20-27 30-37 



48 

-continued 



CPU Inierface 16 biis wide. If set. 
the microprcpcessor data bits are 
combined wilh the Pon A data bits 
eflectively produce a 16 bit Pon. A1 



Port A Port A 



True True 



byte 1 byte 0 

byte 1 byte I 

uProc Port A 

b>ie 1 byie 1 

Port A uProc 

byte 0 byte 0 

uProc Port A 

byte 0 byte 0 



Invert Port A byle address 0. Set to 
invert the least significant bit of 
Port A byte address. 
Invert Port A byle address 1. Set lo 



When the Fifo is accessed by reading or writing the 
Fifo Data Register through the microprocessor port in 
8 bit mode, the bytes are in the same order as the table 
above but the uProc data port is used instead of Port A. 2' 
In 16 bit mode the table above applies. 

Odd Length Transfers 

If the data transfer is not a multiple of 32 words, or j 
128 bytes, the microprocessor must manipulate the in- 
ternal registers of the chip to ensure all data is trans- ^ . 
ferred. Port A Ack and Port B Req are normally not 
asserted until all 32 words of the selected RAM are 
available. These signals may be forced by writing to the 
appropriate RAM status bits of the Data Transfer Status 
Register. 

When an odd length transfer has taken place the 
microprocessor must wait until both ports are quiescent 
before manipulating any registers. It should then reset 
both of the Enable Data Transfer bits for Port A and 
Port B in the Data Transfer Control Register. It must 
then determine by reading their Address Registers and 
the RAM Access Control Register whether RAM X or 
RAM Y holds the odd length data. It should then set the 
corresponding Address Register to a value of 20 hexa- 
decimal, forcing the RAM full bit and setting the ad- 
dress to the first Word. Finally the microprocessor 
should set the Enable Data Transfer bits to allow the 
chip to complete the transfer. 

At this point the Fifo chip will think that there are ^ 
now a full 128 bytes of data in the RAM and will trans- 
fer 128 bytes if allowed to do so. The fact that some of 
these 12S bytes are not valid must be recognized exter- 
nally to the FIFO chip. 

PROGRAMMABLE REGISTERS 

DaU Transfer Configuration Register (Read/Write) 

Register Address 0 

This register is cleared by the reset signal. 



A byle address. 
Checksum Carry 
carry out of the 1 
lo carry back int< 
bit of the adder. 

Reset. Writing a I to this bit will 
met the other regiiten. This bit 

be read as a 0. No other re 
ihouid be written 
clock cycles after writing tc 



WD Mode Selifi 
use the Western V , 
protocol, otherwise 



the Adaptec 6250 



P»rity Chip. Set if this chip is 
accumulate Port B parities. 
Parity Correct Mode. Set if tl 
parity chip is to correct parall 
parity on Port B. 



Data Transfer Control Register (Read/Write) 
Register Address 1 

This register is cleared by the reset signal or by wr 
ig to the reset bit. 



Enable DaU Transfer on Port A 
enable the Port A Req/Ack hi 
Enable DaU Transfer on Port B. Sei to 
enable the Port B Req/Ack handshake. 
Port A to Port B. If set, data 
transfer is from Port A to Port B. If 
reset, data transfer is from Port B to 
Pon A. In order to avoid any glitches 
on the request Isies, (he sute of this 
bit should not be altered at the same 
lime as ihe enable data transfer bits Q 
or 1 above 

uProcessor Parity Enable. Set if parity 
is to be checked on Ihe microprocessor 
inierface. Ii will only be checked when 
writing to Ihe Fifo Data Register or 
reading from Ihe Fifo Data or Checksum 
Registers, or during a Pon A 
Request/Acknowledge transfer in 16 bit 
mode. The chip will, however, always 
re-generate parity ensuring thai 
correct parity is written lo the RAM or 

Port A Parity Enable. Set if parity is 
to be checked on Port A. It is checked 
when accessing the Fifo Data Register in 

16 bil mode, or during a Port A 
Request/Acknowledge transfer. The chip 
will, however, always n-generate parity 
ensuring that correct parity is written 
lo the RAM or read on Ihe Port A 

Port B Parity Enable Set if Port B 
data has valid byte parities. If it is 
not set, byte parity is generated 
inlemally to the chip when writing lo 
the RAMs. Byte parity is not checked 
when writing from Pon B, but always 
checked when reading to Port B. 
Checksum Enable. Set to enable writing 
to Ihe 16 bil checksum register. This 

fcwall RAM accesses, including 
accesses to the Fifo DaU Register, as 



register. This bit m 
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Port A Master. S«l if Port A is lo 
operate in the masler mode on Port A 
during the data transfer. 



Data Transfer Status Register (Read Only) 
Register Address 2 

This register is cleared by the reset signal or by writ- lo 
ing to the reset bit. 



RAM X Address Register (Read/Write) 
Register Address 4 

This register is cleared by the reset signal or by writ- 
5 ing to the reset bit. The Enable Data Transfer bits in the 
Data Transfer Control Register must be reset before 
write to this register, else the write will be 



RAM X word addrei 
RAM X full 
Not Used 



byte address registers. 
uProc Port Parity Error. Set if the 
uProc Parity Enable bit is set and a 
parity error is delected on the 
microprocessor interface during any RAM 
access or write lo the Checksum Register 
in 16 bit mode. 

Port A Parity Error. Set if the Port A 
Parity Enable bit is set and a parity 
error is detected on the Pon A 

to the Checksum Register. 
Pon B Pari 



onngur. 



lot in parity correct n 
ro result is detected when the 
Parity Sync signal is true. It is also 
set whenever data is read out onto Port 
B and the data being read back through 
the bidirectional buffer does not 
compare. 

Port B Bytes 0-3 Parity Error. Set 
whenever the data being read out of the 
RAMs on the Port B side has bad parity. 



Ram Access Control Register (Read/Write) 
Register Address 3 

This register is cleared by the reset signal or by writ- 
ing to the reset bit. The Enable Data Transfer bits in the 
Data Transfer Control Register must be reset before 
attempting to write to this register, else the write will be 
ignored. 



Pon A byte ad 



is bit is Ihe 



■s read directly bypassing any inversion 

done by the invert bit in Ihe Data 

Transfer Configuration Register. 

Port A byte address I. This bit is the 

most signjiicant byte address bit. It 

is read directly bypassing any inversion 

done by the invert bit in Ihe DaU 

Transfer Configuration Register. 

Pon A 10 RAM Y. Set if Port A is 

accessing RAM Y, and reset if h is 

accessing RAM X . 

Pon B to RAM Y. Sel if Port B is 

accessing RAM Y, and reset if ii is 

accessing RAM X . 

Long Burst. If the chip is configured 

to transfer data on Pon A as > master, 

and this bit is reset, the chip will 

only negate Pon A Ack/Rdy after every S 

bytes, or 4 words in 16 bit mode, have 



" RAM Y Address Register (Read/Write) 

Register Address 5 

This register is cleared by the reset signal or by writ- 
ing to the reset bit. The Enable Data Transfer bits in the 
20 Data Transfer Control Register must be reset before 
attempting to write to this register, else the write will be 
ignored. 



Fifo Data Register (Read/Write) 

Register Address 6 

The Enable Data Transfer bits in the Data Transfer 
Control Register must be reset before attempting to 
write to this register, else the write will be ignored. The 

35 Port A to Port B bit in the Data Transfer Control regis- 
ter must also be set before writing this register. If it is 
not, the RAM controls will be incremented but no data 
will be written to the RAM. For consistency, the Port 
A to PortB should be reset prior to reading this register. 

40 Bits 0-7 are Fifo Data. The microprocessor may 
access the FIFO by reading or writing this register. The 
RAM control registers are updated as if the access was 
using Port A. If the chip is configured with a 16 bit 
CPU Interface the most significant byte will use the 

45 Port A 0-7 data lines, and each Port A access will incre- 
ment the Port A byte address by 2. 

Port A Checksum Register (Read/Write) 
Register Address 7 
$0 This register is cleared by the reset signal or by writ- 



mg 



3 the ri 



;t bit. 



Pon A Ack/Rdy will be negated every 16 
bytes, or 8 words in 16 bit mode. 
Not Used. 



Bits 0-7 are Checksum Data. The chip will accumu- 
late a 16 bit checksum for all Port A accesses. If the chip 
is configured with a 16 bit CPU interface, the most 
55 significant byte is read on the Port A 0-7 data lines. If 
data is written directly to this register it is added to the 
current contents rather than overwriting them. It is 
important to note that the Checksum Enable bit in the 
Data Transfer Control Register must be set to write this 
iO register and reset to read it. 

PROGRAMMING THE FIFO CHIP 
In general the fifo chip is programmed by writing to 
the data transfer configuration and control registers to 
63 enable a data transfer, and by reading the dau transfer 
status register at the end of the transfer to check the 
completion status. Usually the data transfer itself will 
take place with both the Port A and the Pon B hand- 
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shakes enabled, and in this case the data transfer itself 
should be done without any other microprocessor inter- 
action. In some applications, however, the Port A hand- 
shake may not be enabled, and it will be necessary for 
the microprocessor to fill or empty the fifo by repeat- 5 
ediy writing or reading the Fifo Dau Register. 

Since the fifo chip has no knowledge of any byte 
counts, there is no way of telling when any data transfer 
is complete by reading any register within this chip 



itself. Determination of whether the data transfer has 
been completed must therefore be done by some other 
circuitry outside this chip. 

The following C language routines illustrate how the 
parity FIFO chip may be programmed. The routines 
assume that both Port A and the microprocessor port 
are connected to the system microprocessor, and return 
a size code of 16 bits, but that the hardware addresses 
the Fifo chip as long 32 bit registers. 



struct FIFO_regs [ 

unsigned char conf ig, «1 , a2 ,a3 ; 

unsigned char control, bl,b2,b3; 

unsigned char status, cl , c2 , c3; 

unsigned char tam_access_concrol,dl,il2,d3; 

unsigned char rani_X_addr,el,e2,e3; 

unsigned char raiB_Y_addr,£l,£2,f3; 

unsigned long daca; 

unsigned int checksum, hi; 

)i 

#define FIFOl ((struct riF0_reg8*) FIFO_BASE_ADDRESS) 

#deCine FIFO_RESET 0x80 
^define FIF0_16_BITS 0x08 
^define FIF0_Ca5ry_WRAP 0x40 
^define FIFO_PORT_A_ENABLE OxOl 
#define riFO~PORT_B_ENABLE 0x02 
^define FIFO~PORT_ENABLES 0x03 
^define FIF0_PORT~A_TO_B 0x04 
^define FIFO_CHECKSUM_ENABLE 0x40 
//define FIFO~DATA_IN_RAH 0x01 
^define FIFO_FORCE_RAM_FULL 0x20 

^define PORT A T0_P0RT B(fifo) ((fifo-> control ) & 0x04) 
//define PORTlAlBYTE_ADDRESS(f ifo) ( (f ifo->ram_access_control) & 
0x03) 

//define PORT_A T0_RAM_Y(f ifo) ((f ifo->raB_acce8S_control ) & 

0x04) - - - 

tfderine PORT_B_TO_RAM_Y(f ifo) ((fifo-> ram_aceess_control ) & 

0x08) 

The following routine initiates a Fifo data transfer using 
two values passed to it. 

conf"ig_data This is the data to be written to the 
configuration register. 

control data This is the data to be written to the Data 

Transfer Control Register. If the data transfer 
is to take place automatically using both the 
Port Aand Port B handshakes, both data transfer 
enables bits should be set in this parameter. 
«»********««*******««*«************************************/ 



FIFO_initiate__data_transfer(config_data, control_data) 

unsigned char''config data, control_data; 

{ 



FIF01->config - config_data | FIFO_RESET; /* Sec 

Configuration value & Reset */ 

FIF01->control - control_data & (-FIFO_PORT_EHABLES) ; /* Set 
everything but enables */ 

FIF01->control - control_data ; /* Set data transfer 

enables */ 
) 

/*********************************************************** 

The following routine forces the transfer of any odd bytes 
that have been left in the Fifo at the end of a data transfer. 
It first disables both ports, then forces the Ram Full bits, and 
then re-enables the appropriate Port. 

FIFO force_odd_length_cran8fer() 
{ 

FIF01->control &- -FIFO_PORT_ENABLES; /* Disable Ports A & B */ 
if (PORT_A_TO_PORT_B(FIF01))~ { 

if Tport_a_to_rah_y(fifoi)) { 

FIF01->ram Y addr - FIFO FORCE_HAM FULL; /* Set RAM Y 

full */ 

} 

else FIF01->raia_X addr - FIFO_FORCE RAM FULL ; /* Set RAM 
X full */ 

FIF01->control |= FI FO_PORT_B_ENABLE ; /* Re-Enable 

Port B */ ~ ~ . 

) 

else { 

if (P0RT_B_TO_RAM_Y(riF01)) ( 

FIFOl->ram Y addr - FIFO_FORCE_RAM_FULL ; /* Set 

RAM Y full */ 
] 

else FIF01->rain_X addr - FIFO_FORCE RAH FULL ; /* Set RAM 
X full */ ■ ~ 

FIFei->control I- FIFO_PORT_A_ENABLE ; /* Re-Enable 
Port A */ 
} 

1 

The following routine returns how many odd bytes have been 
left in Che Fifo at the end of a daca transfer. 

inc FIFb_count_odd_bytes() 
{ 

int nuinber_odd_bytes; 
number_odd_by tes«0 ; 

if (FIF01->status & FIFO_DATA_IH_RAM) { 
if (PORT_A_TO_PORT_B( FIFO 1)7 { 

number_odd_bytes - (PORT_A_BYTE_ADDRESS (FIFOl) ) ; 
if {PORT_A_TO_RAM_Y(FIFOT)T 

number_odd_bytes +- (FIF01->r«n_Y_addr) * 4 ; 
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else numberoddbyces +- (FlF01->rain_X_addr ) * 4 ; 

) 

else ( 

if (PORT B_T0_RAMjr(FIF01)) 

nuniber_odd_bytes - (FIF01->r«m_Y_«ddr) * 4 ; 
else number~odd~bytes « (FlF01->r«o_X_addr) * 4 j 

) 

} 

recurn (number odd_bytes); 

) 

/******t*************************«**********«*************** 

The following routine tests the aicroprocessor interface of 
the chip. It first writes end reads the first 6 registers. It 
then writes Is, Os, and an address pattern to the RAM, reading the 
data back and checking it. 

The test returns a bit significant error code where each 
bit represents the address of the registers that failed. 

Bit 0 ■ config register failed 
Bit I " control register failed 
Bit 2 " status register failed 
Bit 3 = ram access control register failed 
Bit 4 = ram X address register failed 
Bit S « ram Y address register failed 
Bit 6 - data register failed 
Bit 7 " checksum register failed 
******«******************;**********************************/ 

^define RAM_DEPTH 64 /* number of long words in Fifo Ram */ 

reg_expected_data[6l « { dx7F, OxFF, 0x00, OxlF. Ox3F, 0x3F ); 

char FIFO uprocessor_interface_test () 
{ 

unsigned long test_data; 
char *register_addr; 

int i; 

char j, error; 

FIF01->config - FIFO_RESET; /* reset the chip */ 

error-0; 

register addr -(char *) FIFOl; 

/* first test registers 0 thru 5 */ 

for (i-0; i<6; i++) ( 

*register_addr - OxFF; /* write test data */ 

if {*register_addr !- reg_expected_data[i3) error 5- j; 
*register_addr - 0; 7* write Os to register */ 

if (ftregister_addr) error |« j; 

*register_addr - OxFF; /* write test data again */ 

if (*register_addr l« reg_expected_data[i] ) error |" j; 
FIF01->config - FIFO_RESET; /* reset the chip */ 

if (*register_addr) error j; /* register should be 0 */ 
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regiscer_addrt-»; 

} 

/* now test Rao data & checksum registers 
test Is throughout Ram & then test Os */ 

for (test_data - -1; te8t_data I- 1; test data-^-*-) ( /* test 
for Is & Os~*/ ~ ~ 

FIF01->config - FIFO_RESET { FIF0_16_BITS j 
FIF01->control - FIF0_P0RT_A_T0_b7 ~ 

for (i-=0;i<RAM_DEPTH;i++) ~ /* write data to BAH */ 

FIF01->data « cest_data: 
FIF01->control - Oj 
for Ci-0;i<RAM_DEPTH;i++) 

if (FIF01->data I- cesc_data) error j- j; /* read 

d check data */ ~ 

if (FIF01->checksum) error |- 0x80; /* checksum 

should « 0 */ 



/* now test Ram data with address pattern 
uses a different pattern for every byte */ 

test_data=Ox00010203; /* address pattern start */ 

FIF01->config - FIFO_RESET | FIF0_16_BITS | FIFO_CARRY WRAP; 
FIF01->control •= FIFO PORT A TO B~| FIFO CHECKSUM ENABLE; 
for (i"=0;i<RAM_DEPTHj++) T~ ~ 

FIF01->data - test_data; /* write address pattern */ 

test data +- OxOA040404; 

} 

test_data-Ox00010203; /* address pattern start */ 

FIFOl->concrol - •FIFO_CHECKSUM_ENABLE; 
for (i»0; i<RAM_DEPTH; { 

if {FIF01->stacus J- FirO_DATA_IH_RAM) 

error |- OxOA; /* should be data in ram */ 

if CFIF01->data I- cest_data) error !- j; /* read & 
check address pattern */ 

tesc_data 0x040C>0ii04; 

) 

if (FIF01->checksum !- 0x0102) error |- 0x80; /* test 
checksum of address pattern */ 

FlF01->config - FIF0_RESET | FIF0_16_BITS ; /* inhibit carry 
wrap */ ~ ~ 

FIF01->checksum - OxFEFE; /* writing adds to checksum */ 

if (FZFOI->checksum) error !>0xBO; /* checksum should be 0 */ 
if (FIF01->status) error }- 0x04; /* status should be 0 */ 
return (error); 
} 
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/* go to next register */ 



Attorney Docket No. :AUSP7209 
WP 1 /WSW/AUSP/ 7209.001 
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What is claimed is: 

1. Network server apparatus for use with a data net- 
work and a mass storage device, comprising: 

an interface processor unit coupleable to said net- 
work and to said mass storage device; * 

a host processor unit capable of running remote pro- 
cedures defined by a client node on said network; 

means in said interface processor unit for satisfyiiig 
requests from said network to store data from said 
network on said mass storage device; 1° 

means in said interface processor unit for satisfying 
requests from said network to retrieve data from 
said mass storage device to said network; and 

means in said interface processor unit for transmitting 
predefined categories of messages from said net- 15 
work to said host processor unit for processing in 
said host processor unit, said transmitted messages 
including all requests by a network client to run 
client-defined procedures on said network server 
apparatus. 

2. Apparatus according to claim 1, wherein said inter- 
face processor unit comprises: 
a network control unit coupleable to said network; 
a data control unit coupleable to said mass storage 
device; 

a buffer memory; . . 

means in said network control unit for transmitting to 
said data control unit requests from said network to 
store specified storage data from said network on 
said mass storage device; . . 

means in said network control unit for transmitting 
said specified storage data from said network to 
said buffer memory and from said buffer memory 
to said data control unit; 
means in said network control unit for transmittmg to 35 
said data control unit requests from said network to 
retrieve specified retrieval data from said mass 
storage device to said network; 
means in said network control unit for transmitting 
said specified retrieval data from said data control 40 
unit to said buffer memory and from said buffer 
memory to said network; and 
means in said network control unit for transmitting 
said predefined categories of messages from said 
network to said host processing unit for processing 45 
by said host processing unit. 
3. Apparatus according to claim 2, wherein said data 
control unit comprises: 
a storage processor unit coupleable to said mass stor- 

age device; 
a file processor unit; 

means on said file processor unit; for translatmg said 
file system level storage requests from said net- 
work into requests to store data at specified physi- 
cal storage locations in said mass storage device; 55 

means on said file processor unit for instructing said 
storage processor unit to write data from said 
buffer memory into said specified physical storage 
locations in said mass storage device; 

means on said file processor unit for translating file to 
system level retrieval requests from said network 
into requests to retrieve data from specified physi- 
cal retrieval locations in said mass storage device; 

means on said file processor unit for instructing said 
storage processor unit to retrieve data from said 65 
specified physical retrieval locations m said mass 
storage device to said buffer memory if said data 
from said specified physical locations is not already 
in said buffer memory; and 



means in said storage processor unit for transmitting 
data between said buffer memory and said mass 
storage device. 
4. Network server apparatus for use with a data net- 
work and a mass storage device, comprising: 
a network control unit coupleable to said network; 
a data control unit coupleable to said mass storage 
device; 

a buffer memory ; . 
means for transmitting from said network control unit 
to said data control unit requests from said network 
to store specified storage data from said network 
on said mass storage device; 
means for transmitting said specified storage data by 
DMA from said network control unit to said buffer 
memory and by DMA from said buffer memory to 
said data control unit; 
means for transmitting from said network control unit 
to said data control unit requests from said network 
I to retrieve specified retrieval data from said mass 
storage device to said network; and 
means for transmitting said specified retrieval data by 
DMA from said data control unit to said buffer 
memory and by DMA from said buffer memory to 
> said network control unit. 

5. Apparatus according to claim 1. for use further 
with a buffer memory, and wherein said requests from 
said network to store and retrieve data include file sys- 
tem level storage and retrieval requests respectively, 
D and wherein said interface processor unit comprises: 
a storage processor unit coupleable to said mass stor- 
age device; 
a file processor unit; 

means on said file processor unit for translating said 
file system level storage requests into requests to 
store data at specified physical storage locations in 
said mass storage device; 
means on said file processor unit for instructing said 
storage processor unit to write data from said 
buffer memory into said specified physical storage 
locations in said mass storage device: 



means on said file processor unit for translating said 
file system level retrieval requests into requests to 
retrieve data from specified physical retrieval loca- 
tions in said mass storage device; 
means on said file processor unit for instructing said 
storage processor- unit to retrieve data from said 
specified physical retrieval locations in said mass 
storage device to said buffer memory if said data 
from said specified physical locations is not already 
in said buffer memory; and 
means in said storage processor unit for transmittmg 
data between said buffer memory and said mass 
storage device. , . 

6 A dafa control unit for use with a date nwwork and 
a mass storage device, and in response to file system 
level storage and retrieval requests from said data net- 
work, comprising: 
a data bus different from said network; 
a buffer memory bank coupled to said bus; 
storage processor apparatus coupled to said bus ana 

coupleable to said mass storage device; 
file processor apparatus coupled to said bus, said file 
processor apparatus including a local memory bank 
fi«t means on said file processor unit for translating 
said file system level storage requests mto requests 
to store data at specified physical storage locations 
in said mass storage device; and 
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second means on said file processor unit for translat- 
ing said file system level retrieval requests into 
requests to retrieve data from specified physical 
retrieval locations in said mass storage device, said 
first and second means for translating collectively S 
including means for caching file control informa- 
tion through said local memory bank in said file 
processor unit, 

said data control unit further comprising means for 
caching the file data, to be stored or retrieved ac- 10 
cording to said storage and retrieval requests, 
through said buffer memory bank. 

7. A network node for use with a data network and a 
mass storage device, comprising: 

a system buffer memory; 15 

a host processor unit having direct memory access to 
said system buffer memory; 

a network control unit coupleable to said network 
and having direct memory access to said system 
buffer memory; 20 

a data control unit coupleable to said mass storage 
device and having direct memory access to said 
system buffer memory; 

first means for satisfying requests from said network 
to store data from said network on said mass stor- 25 
age device; 

second means for satisfying requests from said net- 
work to retrieve data from said mass storage device 
to said network; and 
third means for transmitting predefined categories of 30 
messages from said network to said host processor 
unit for processing in said host processor unit, said 
first, second and third means collectively including 
means for transmitting from said network control 
unit to said system memory bank by direct mem- 35 
cry access file data from said network for stor- 
age on said mass storage device, 
means for transmitting from said system memory 
bank to said data control unit by direct memory 
access said file data from said network for stor- 40 
age on said mass storage device, 
means for transmitting from said data control unit 
to said system memory bank by direct memory 
access file data for retrieval from said mass stor- 
age device to said network, and 
means for transmitting from said system memory 
bank to said network control unit said file data 
for retrieval from said mass storage device to 
said network; ^ 
at least said network control unit including a micro- 
processor and local instruction storage means dis- 
tinct from said system buffer memory, all instruc- 
tions for said microprocessor residing in said local 
instruction storage means. jj 
8. A network file server for use with a data network 
and a mass storage device, comprising: 
a host processor unit running a Unix operating sys- 
tem; 

an interface processor unit coupleable to said net- 60 
work and to said mass storage device, said interface 
processor unit including means for decoding all 
NFS requests from said network, means for per- 
forming all procedures for satisfying said NFS 
requests, means for encoding any NFS reply mes- 65 
sages for return transmission on said network, and 
means for transmitting predefined non-NFS cate- 
gories of messages from said network to said host 



processor unit for processing in said host processor 

9. Network server apparatus for use with a data net- 
work, comprising: 

a network controller coupleable to said network to 
receive incoming information packets over said 
network, said incoming information packets in- 
cluding certain packets which contain part or all of 
a request to said server apparatus, said request 
being in either a first or a second class of requests to 
said server apparatus; 

a first additional processor; 

an interchange bus different from said network and 
coupled between said network controller and said 
first additional processor; 

means in said network controller for detecting and 
satisfying requests in said first class of requests 
contained in said certain incoming information 
packets, said network controller lacking means in 
said network controller for satisfying requests in 
said second class of requests; 

means in said network controller for detecting and 
assembling into assembled requests, requests in said 
second class of requests contained in said certain 
incoming information packets; 

means for delivering said assembled requests from 
said network controller to said first additional pro- 
cessor over said interchange bus; and 

means in said first additional processor for further 
processing said assembled requests in said second 
class of requests. 

10. Apparatus according to claim 9, wherein said 
packets each include a network node destination ad- 
dress, and wherein said means in said network control- 
ler for detecting and assembling into assembled re- 
quests, assembles said assembled requests in a format 
which omits said network node destination addresses. 

11. Apparatus according to claim 9, wherein said 
means in said network controller for detecting and satis- 
fying requests in said first class of requests, assembles 
said requests in said first class of requests into assembled 
requests before satisfying said requests in said first class 
of requests. 

13. Apparatus according to claim 9, wherein said 
packets each include a network node destination ad- 
dress, wherein said means in said network controller for 
detecting and assembling into assembled requests, as- 
sembles said assembled requests in a format which omits 
said network node destination addresses, and wherein 
said means in said network controller for detecting and 
satisfying requests in said first class of requests, assem- 
bles said requests in said first class of requests, in a for- 
mat which omits said network node destination ad- 
dresses, before satisfying said requests in said first class 
of requests. 

13. Apparatus according to claim 9, wherein said 
means in said network controller for detecting and satis- 
fying requests in said first class includes means for pre- 
paring an outgoing message in response to one of said 
first class of requests, means for packaging said outgo- 
ing message in outgoing information packets suitable for 
transmission over said network, and means for transmit- 
ting said outgoing information packets over said net- 

14. Apparatus according to claim 9, further compris- 
ing a buffer memory coupled to said interchange bus. 
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and wherein said means for delivering said assembled means for delivering said message contained in said 
requests comprises: particular requests to said second network control- 
means for transferring the contents of said assembled ler over said interchange bus; and 

requests over said interchange bus into said buffer means in said second network controller for transmit- 
memory; and 5 ting said message contained in said particular re- 
means for notifying said first additional processor of quests over said third network. 

the presence of said contents in said buffer mem- 20. Apparatus according to claim 9, for use further 

ory. with a mass storage device, wherein said first additional 

15. Apparatus according to claim 9, wherein said processor comprises a data control unit couplable to 
means in said first additional processor for further pro- 10 said mass storage device, wherein said second class of 
cessing said assembled requests includes means for pre- requests comprises remote calls to procedures for man- 
paring an outgoing message ir response to one of said aging a file system in said mass storage device, and 
second class of requesu, said apparatus further compris- wherein said means in said first additional processor for 
ing means for delivering said outgoing message from further processing said assembled requests in said sec- 
said first additional processor to said network controller 15 ond class of requests comprises means for executing file 
over said interchange bus, said network controller fur- system procedures on said mass storage device in re- 
ther comprising means in said network controller for sponse to said assembled requests. 

packaging said outgoing message in outgoing informa- n Apparatus according to claim 20, wherein said file 

tion packets suitable for transmission over said network, system procedures include a read procedure for reading 

and means in said network controller for transmitting 20 ^ata from said mass storage device, 

said outgoing information packages over said network. gaid means in said first additional processor for fur- 

16. Apparatus according to claim 9, wherein said first ti,er processing said assembled requests including 
class of requests comprises requests for an address of means for reading data from a specified location in 
said server apparatus, and wherein said means in said ^aid mass storage device in response to a remote 
network controller for detecting and satisfying requests 25 ^all to said read procedure, 

in said first class comprises means for preparing a re- apparatus further including means for delivering 

sponse packet to such an address request and means for j^j^j jq ^aid network controller, 

transmitting said response packet over said network. network controller further comprising means on 

17. Apparatus according to claim 9, for use further network controller for packaging said data in 
with a second data network, said network controller 30 outgoing information packets suitable for transmis- 
being coupleable further to said second network, jj^^ network, and means for transmitting 
wherein said first class of requests comprises requests to outgoing information packets over said net- 
route a message to a destination reachable over said work. 

second network, and wherein said means in said net- 22. Apparatus according to claim 21, wherein said 

work controller for detecting and satisfying requests in 35 f„ delivering comprises: 

said first class comprises means for detecting that one of ^ system buffer memory coupled to said interchange 
said certain packets comprises a request to route a mes- 

sage contained in said one of said certain packets to a tneans in said data control unit for transferring said 

destination reachable over said second network, and ^^^^ ^^^^ ^^j^j interchange bus into said buffer 

means for transmitting said message over said second 40 memory; and 

network. means in said network controller for transferring said 

18. Apparatus according to claim 17, for use further data over said interchange bus from said system 
with a third data network, said network controller fur- buffer memory to said network controller. 

ther comprising means in said network controller for 23, Apparatus according to claim 20, wherein said file 

detecting particular requests in said incoming informa- *' system procedures include a read procedure for reading 

tion packets to route a message contained in said partic- ^ specified number of bytes of data from said mass stor- 

ular requests, to a destination reachable over said third device beginning at an address specified in logical 

network, said apparatus further comprising: jerms including a file system ID and a file ID, said 

a second network controller coupled to said inter- means for executing file system procedures comprising: 

change bus and couplable to said third data net- ^ taeans for converting the logical address specified in 

work; a remote call to said read procedure to a physical 

means for delivering said message contained in said address; and 

particular requests to said second network control- j^^ans for reading data from said physical address in 

ler over said interchange bus; and ggid mass storage device, 

means in said second network controller for transmit- 24. Apparatus according to claim 23, wherein said 

ting said message contained in said particular re- ^^^^ storage device comprises a disk drive having a 

quests over said third network. numbered tracks and sectors, wherein said logical ad- 

19 Apparatus according to claim 9. for use further dress specifies said file ^y*'^"" J J .Ji^^^;^^^^ 

with a third data network'said network controller fur- 60 byte offset, and wherein «ud physical address specifies 

ther comprising means in said network controller for a corresponding track and sector number 

detecting particular requests in said incoming informa- 25. Apparatus according to claim 20. wherem ^d fJe 

don packets to route a message contained in said partic- system procedures include a 'ffP'f^'^'^^Zl^^f 

uTar requests, to a destination reachable over said third a specified number of bytes of data from .^^^^ «^ 

^etwoZ said apparatus further comprising: 65 age device beginning at an address specified m logical 

rrond network controller coujled to said inter- tenns including a fi e ^V^*^ and %file I^^^ 

change bus and couplable to said third data net- said data control unit compnsmg a Rlf P^o«*^ 

work- coupled to said interchange bus and a storage pro- 
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cesser coupled to said interchange bus and coupla- perform said function of detecting and assembling 

sai? fi^ ^ri""""" r '""'"^^ « second 

said file processor compnsmg means for converting . class of requests 

the logical address specified in a remote call to said 31. Network server apparatus for use with a data 

read procedure to a physical address, 5 network, comprising: 

said apparatus further comprising means for deliver- a network controller coupleable to said network to 

mg said physical address to said storage processor. receive incoming information packets over said 

said storage processor comprising means for reading network, said incoming information packets in- 

data from said physical address in said mass storage eluding certain packets which contain part or all of 

device and for transferring said data over said in- "> a message to said server apparatus, said message 

terchange bus into said buffer memory; and being in either a first or a second class of messages 

means m said network controller for transferring said to said server apparatus, said messages in said first 

data over said mterchange bus from said system class or messages including certain messages con- 

bufler memory to said network controller. Uining requests; 

26. Apparatus according to claim 20, wherein said file a host computer- ' 

system procedures include a write procedure for writ- an interchange bus different from said network and 

ing data contained m an assembled request, to said mass coupled between said network controller and said 

storage device, host computer; 

said means m said first additional processor for fur- means in said network controller for detecting and 

ther processing said assembled requests including ^ satisfying said requests in said first class of mes- 

means for writmg said data to a specified location sages; 

in said mass storage device in response to a remote means for delivering messages in said second class of 

11 aZ T Procedure. messages from said network controller to said host 

27. Apparatus according to claim 9, wherein said first computer over said interchange bus; and 
additional processor comprises a host computer cou- ^eans in said host computer for further processing 
pled to said interchange bus, wherein said second class said messages in said second class of messages. 

of requests comprises remote calls to procedures other 32. Apparatus according to claim 31, wherein said 

than procedures for managing a file system, and packets each include a network node destination ad- 

wherein said means m said first additional processor for dress, and wherein said means for delivering messages 

further processing said assembled requests in said sec- in said second class of messages comprises means in said 

ond class of requests comprises means for executing network controller for detecting said messages in said 

remote procedure calls in response to said assembled second class of messages and assembling them into as- 

"^A^. » , , sembled messages in a format which omits said network 

28. Apparatus according to claim 27. for use further node destination addresses 

with a mass storage device and a data control unit 33. Apparatus according to claim 31, wherein said 
couplable to said mass storage device and coupled to means in said network controller for detecting and satis- 
said mterchange bus, wherein said network controller fying requests in said first class includes means for pre- 
further comprises means in said network controller for paring an outgoing message in response to one of said 
detecting and assembling remote calls, received over requests in said first class of messages, means for pack- 
^ 1m .'. P'*'"'^""' ""f aging a file system aging said outgoing message in outgoing information 
m said mass storage device, and wherein said data con- packets suitable for transmission over said network, and 
troi unit comprises means for executing file system pro- means for transmitting said outgoing information pack- 
cedures on said mass storage device in response to said ets over said network 

m«f il?, r^^"''"''''' ^ sys'e-n 45 34- Apparatus according to claim 31, for use further 

said mass storage device. « ^.^ ^ ^^^^^^ ^^^^ ^^^^^^^ ^^.^ ^^^^^^^ controller 

29. Apparatus according to claim 27, further compris- being coupleable further to said second network, 
ing means for delivering all of said incoming informa- wherein said first class of messages comprises messages 
tion packets not recognized by said network controller ^ routed to a destination reachable over said second 
to said host computer over said interchange bus. 50 "^t^ork, and wherein said means in said network con- 

30. Apparatus according to claim 9, wherein said T"*"" detecting and satisfying requests in said first 
network controller comprises- comprises means for detecting that one of said 

a microprocessor packets includes a request to route a message 

a local instruction memory containing local instnic- f.*'"*^'"*'^' """^ "i^'"^ ^^^ain packets to a destina- 
tion code- 55 reachable over said second network, and means for 

a local bus Mupled between said microprocessor and ^'<^ """^ge o^er said second network, 

said local i„«n.ction memory; Jl ^PPrn? ""^""l" ^\ ^"""^^^ 

bus interface meant for JnfM-f,^i^» c,-^ - * "'"^ '^^^ network, said network control er fur- 

LTfntercian^ ^^^^^ T m-croproces- ,her comprising means in said network controller for 



neh^ort*^in*«^^P^'.^r!l,» f-, • . r ■ network, said apparatus further comprising! 
l^^r^r, H H / T r^"^ ""'^^o- « second network controller coupled tr - 
processor with said data network, • •■• K. .. 



..- J I , ■ . change bus and couplable to said third data net- 
said local mstruction memory including all instruc- work- 
tion code necessary for said microprocessor to " means for delivering said particular messages to said 
perform said function of detecting and satisfying second network controller over said interchange 
requests m said first daw of requests, and all in- bus, substantially without involving said host com- 
struction code necessary for said microprocessor to puter; and 
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means in said second network controller for transmit- 
ting said message contained in said particular re- 
quests over said third network, substantially with- 
out involving said host computer. 
36. Apparatus according to claim 31, for use further 5 mg: 
with a mass storage device, further comprising a data 
control unit coupleable to said mass storage device, 
said network controller further compnsing means in 
said network controller for detecting ones of said 
incoming information packets containing remote 10 
calls to procedures for managing a file system in 
said mass storage device, and means in said net- 
work controller for assembling said remote calls 
from said incoming packets into assembled calls, 
substantially without involving said host computer, 1 5 
said apparatus further comprising means for deliver- 
ing said assembled file system calls to said data 
control unit over said interchange bus substantially 
without involving said host computer, 
and said data control unit comprising means in said 20 
data control unit for executing file system proce- 
dures on said mass storage device in response to 
said assembled file system calls, substantially with- 
out involving said host computer. 

37. Apparatus according to claim 31, further compris- 25 
ing means for delivering all of said incoming informa- 
tion packets not recognized by said network controller 
to said host computer over said interchange bus. 

38. Apparatus according to claim 31, wherein said 
network controller comprises: 30 

a microprocessor; 

a local instruction memory containing local instruc- 
tion code; , 
a local bus coupled between said microprocessor and 

said local instruction memory; 35 
bus interface means for interfacing said mrcroproces- 
sor with said interchange bus at times determined 
by said microprocessor in response to said local 
instruction code; and 
network interface means for interfating said micro- 40 

processor with said data network, 
said local instruction memory including all instruc- 
tion code necessary for said microprocessor to 
perform said function of detecting and satisfying 
requests in said first class of requests. 45 
39. File server apparatus for use with a mass storage 
device, comprising: 

a requesting unit capable of issuing calls to file system 

procedures in a device-independent form; 
a file controller including means for converting said 
file system procedure calls from said device- 
independent form to a device-specific form and 
means for issuing device-specific commands m 
response to at least a subset of said procedure calls, j, 
said file controller operating in parallel with said 
requesting unit; and 
a storage processor including means for executuig 
said device-specific commands on said mass stor- 
age device, said storage processor operating in ^ 
parallel with said requesting unit and said file con- 
troller. 

40. Apparatus according to claim 39, further compris- 



second delivery means for delivering said device- 
specific commands from said file controller to said 
storage processor over said interchange bus. 
41. Apparatus according to claim 39, further compris- 



an interchange bus coupled to said requesting unit 

and to said file controller; 
first memory means in said requesting unit and ad- 
dressable over said interchange bus; 
second memory means in said file controller; 
means in said requesting unit for preparing in said first 
memory means one of said calls to file system pro- 
cedures; 

means for notifying said file controller of the avail- 
ability of said one of said calls in said first memory 
means; and 

means in said file controller for controlling an access 
to said first memory means for reading said one of 
said calls over said interchange bus into said second 
memory means in response to said notification. 

42. Apparatus according to claim 41, wherein said 
means for notifying said file controller comprises: 

a command FIFO in said file controller addressable 

over said interchange bus; and 
means in said requesting unit for controlling an access 
to said FIFO for writing a descriptor into said 
FIFO over said interchange bus, said descriptor 
describing an address in said first memory means of 
said one of said calls and an indication that said 
address points to a message being sent. 

43. Apparatus according to claim 41, further compns- 



ing: 



means in said file controller for controlling an access 
to said first memory means over said interchange 
bus for modifying said one of said calls in said first 
memory means to prepare a reply to said one of 
said calls; and 
means for notifying said requesting unit of the avail- 
ability of said reply in said first memory. 
44. Apparatus according to claim 41, further compris- 



mg: 



a command FIFO in said requesting processor ad- 
dressable over said interchange bus; and 
means in said file controller for controlling an access 
to said FIFO for writing a descriptor into said 
FIFO over said interchange bus, said descriptor 
describing said address in said first memory and an 
indication that said address points to a reply to said 
one of said calls. 
45. Apparatus according to claim 39, further compris- 

iterchange bus coupled to said file controller and 



an interchange bus; 

first delivery means for delivering said file system 
procedure calls from said requesting unit to said file 
controller over said interchange bus; and 



to said storage processor; 
second memory means in said file controller and 

addressable over said interchange bus; 
means in said file controller for preparing one of said 

device-specific commands in said second memory 

means; 

means for notifying said storage processor ot the 
availability of said one of said commands in said 
second memory means; and 

means in said storage processor for controlling an 
access to said second memory means for reading 
said one of said commands over said interchange 
bus in response to said notification. 

46 Apparatus according to claim 45, wherein said 
means for notifying said storage processor comprises: 



a command FIFO in said storage prcx:essor address- 
able over said interchange bus; and 

means in said file controller for controlling an access 
to said FIFO for writing a descriptor into said 
FIFO over said interchange bus, said descriptor 5 
describing an address in said second memory of 
said one of said calls and an indication that said 
address points to a message being sent. 

47. Apparatus according to claim 39, wherein said 
means for converting said file system procedure calls to 
comprises: 

a file control cache in said file controller, storing 
device-independent to device-specific conversion 
information; and 

means for performing said conversions in accordance 15 
with said conversion information in said file con- 
trol cache. 

48. Apparatus according to claim 39, wherein said 
mass storage device includes a disk drive having num- 
bered sectors, wherein one of said file system procedure 20 
calls is a read data procedure call, 

said apparatus further comprising an interchange bus 
and a system buffer memory addressable over said 
interchange bus, 

said means for converting said file system procedure 25 
calls including means for issuing a read sectors 
command in response to one of said read data pro- 
cedure calls, said read sectors command specifying 
a starting sector on said disk drive, a count indicat- 
ing the amount of data to read, and a pointer to a 30 
buffer in said system buffer memory, and 

said means for executing device-specific commands 
including means for reading data from said disk 
drive beginning at said starting sector and continu- 
ing for the number of sectors indicated by said 35 
count, and controlling an access to said system 
buffer memory for writing said data over said inter- 
change bus to said buffer in said system buffer 
memory. 

49. Apparatus according to claim 48, wherein said file 40 
controller further includes means for determining 
whether the data specified in said one of said read data 
procedure calls is already present in said system buffer 
memory, said means for converting issuing said read 
sectors command only if said data is not already present 45 
in said system buffer memory. 

50. Apparatus according to claim 48, further compris- 

means in said storage processor for controlling a 
notification of said file controller when said read 50 
sectors command has been executed; 

means in said file controller, responsive to said notifi- 
cation from said storage processor, for controlling 
a notification of said requesting unit that said read 
data procedure call has been executed; and J5 

means in said requesting unit, responsive to said noti- 
fication from said file controller, for controlling an 
access to said system buffer memory for reading 
said data over said interchange bus from said buffer 
in said system buffer memory to said requesting 60 
unit. 

51. Apparatus according to claim 39, wherein said 
mass storage device includes a disk drive having num- 
bered sectors, wherein one of said file system procedure 
calls is a write data procedure call, 65 

said apparatus further comprising an interchange bus 
and a system buffer memory addressable over said 
interchange bus. 



said means for co*iverting said file system procedure 
calls including means for issuing a write sectors 
command in response to one of said write data 
procedure calls, said write dau procedure call 
including a pointer to a buffer in said system buffer 
memory containing data to be written, and said 
write sectors command including a starting sensor 
on said disk drive, a count indicating the amount of 
data to write, and said pointer to said buffer in said 
buffer memory, and 

said means for executing device-specific commands 
including means for controlling an access to said 
buffer memory for reading said data over said in- 
terchange bus from said buffer in said system buffer 
memory, and writing said data to said disk drive 
beginning at said starting sector and continuing for 
the number of sectors indicated by said count. 

52. Apparatus according to claim 51. further compris- 

means in said requesting luit for controlling an access 
to said system buffer memory for writing said data 
over said interchange bus to said buffer in said 
system buffer memory; and 

means in said requesting unit for issuing said one of 
said write data procedure calls when said data has 
been written to said buffer in said system buffer 
memory. 

53. Apparatus according to claim 52, further compris- 
ing: 

means in said requesting unit for issuing a buffer allo- 
cation request; and 

means in said file controller for allocating said buffer 
in said system buffer memory in response to said 
buffer allocation request, and for providing said 
pointer, before said data is written to said buffer in 
said system buffer memory. 

54. Network controller apparatus for use with a first 
data network carrying signals representing information 
packets encoded according to a first physical layer 
protocol, comprising: 

a first network interface unit, a first packet bus and 
first packet memory addressable by said first net- 
work interface unit over said first packet bus, said 
first network interface unit including means for 
receiving signals over said first network represent- 
ing incoming information packets, extracting said 
incoming information packets and writing said 
incoming information packets into said first packet 
memory over said first packet bus; 

a first packet bus port; 

first packet DMA means for reading data over said 
first packet bus from said first packet memory to 
said first packet bus port; and 

a local processor including means for accessing said 
incoming information packets in said first packet 
memory and, in response to the contents of said 
incoming information packets, controlling said first 
packet DMA means to read selected data over said 
first packet bus from said first packet memory to 
said first packet bus port, said local processor in- 
cluding a CPU, a CPU bus and CPU memory con- 
taining CPU instructions, said local processor op- 
erating in response to said CPU instructions, said 
CPU instructions being received by said CPU over 
said CPU bus independently of any of said writing 
by said first network interface unit of incoming 
information packets into said first packet memory 
over said first packet bus and independently of any 
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of said reading by said first packet DMA means of 
data over said first packet bus from said first packet 
memory to said first packet bus port. 
55. Apparatus according to claim 54, wherein said 
first network interface unit further includes means for 5 
reading outgoing information packets from said first 
packet memory over said first packet bus, encoding said 
outgoing information packets according to said first 
physical layer protocol, and transmitting signals ove 
said first network representing said outgoing inform; 
tion packets, 

said local processor further including means for pre- 
paring said outgoing information packets in said 
first packet memory, and for controlling said first 
network interface unit to read, encode and transmit 15 
said outgoing information packets, 
said receipt of CPU instructions by said CPU over 
said CPU bus being independent further of any of 
said reading by said first network interface unit of 
outgoing information packets from said first packet 20 
memory over said first packet bus. 
56. Apparatus according to claim 54, further compris- 
ing a first FIFO having first and second ports, said first 
port of said first FIFO being said first packet bus port. 
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and all of said accesses to said first packet memory 
over said first packet bus being independent of 
said accesses to said second packet memory over 
said second packet bus. 

59. Apparatus according to claim 58, wherein said 
second physical layer protocol is the same as said first 
physical layer protocol. 

60. Apparatus according to claim 58, further compos- 
ing means, responsive to signals from said processor, for 

10 coupling data from said first packet bus port to said 
second packet bus port. 

61. Apparatus according to claim 60, further compns- 



first and second FIFOs, each having first and second 
ports, said fist port of said first FIFO being said 
first packet bus port and said first port of said sec- 
ond FIFO being said second packet bus port; 

an interchange bus; and 

interchange bus DMA means for transferring data 
between said interchange bus and either said sec- 
ond port of said first FIFO or said second port of 
said second FIFO, selectably in response to DMA 
control signals from said local processor. 
62. Apparatus according to claim 61, wherein said 



57. Apparatus according to claim 56, for use further 25 interchange bus DMA means comprises: 



with an interchange bus, further comprising i 
change bus DMA means for reading data from said 
second port of said first FIFO onto said interchange 



is for c< 



said local processor further including 

trolling said interchange bus DMA means to read 
said data from said second port of said first FIFO 
onto said interchange bus. 
58. Apparatus according to claim 54, for use further 
with a second data network carrying signals represent- 35 
ing information packets encoded according to a second 
physical layer protocol, further comprising: 

a second network interface unit, a second packet bus 
and second packet memory addressable by said 
second network interface unit over said second 40 
packet bus, said second network interface unit in- 
cluding means for reading outgoing information 
packets from said second packet memory over said 
second packet bus, encoding said outgoing infor- 
mation packets according to said second physical 45 
layer protocol, and transmitting signals over said 
second network representing said outgoing infor- 
mation packets; 
a second packet bus port; and 

second packet DMA means for reading data over said 50 

second packet bus from said second packet bus port 

to said second packet memory, 

said local processor further including means for 
controlling said second packet DMA means to 
read data over said second packet bus from said 55 
second packet bus port to said second packet 
memory, and for controlling said second net- 
work interface unit to read, encode and transmit 
outgoing information packets from said data in 
said second packet memory, 60 

said receipt of CPU instructions by said CPU over 
said CPU bus being independent further of any 
of said reading by said second packet DMA 
means of data over said second packet bus from 
said second packet bus port to said second packet 65 
memory, and independent further of any of said 
reading by said second network interface unit of 
outgoing information packets from said second 
packet memory over said second packet bus. 



transfer bus coupled to said second port of said firet 
FIFO and to said second port of said second FIFO; 
couphng means coupled between said transfer bus 

and said interchange bus; and 
a controller coupled to receive said DMA control 
signals from said processor and coupled to said first 
and second FIFOs and to said coupling means to 
control date transfers over said transfer bus. 
63. Storage processing apparatus for use with a plu- 
rality of storage devices on a respective plurality of 
channel buses, and an interchange bus, said interchange 
bus capable of transferring data at a higher rate than any 
of said channel buses, comprising: 
dau transfer means coupled to each of said channel 
buses and to said interchange bus, for transferring 
data in parallel between said data transfer means 
and each of said channel buses at the daU transfer 
rales of each of said channel buses, respectively, 
and for transferring data between said daU transfer 
means and said interchange bus at a data transfer 
rate higher than said data transfer rates of any of 
said channel buses; and 
a local processor including transfer control means for 
controlling said data transfer means to transfer data 
between said dau transfer means and specified ones 
of said channel buses and for controlling said data 
transfer means to transfer date between said date 
transfer means and said interchange bus, 
said local processor including a CPU, a CPU bus 
and CPU memory containing CPU instructions, 
said local processor operating in response to said 
CPU instructions, said CPU instructions being 
received by said CPU over said CPU bus inde- 
pendently of any of said data transfers between 
said channel buses and said data transfer means 
and independently of any of said date transfers 
between said data transfer means and said inter- 
change bus. 

64. Apparatus according to claim 63, wherein the 
highest data transfer rate of said interchange bus is 
substantially equal to the sum of the Ughest data trans- 
fer rates of all of said channel buses. 
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65. Apparatus according to claim 63, wherein said 
data transrer means comprises: 
a FIFO corresponding to each of said channel buses, 

each of said FIFOs having a first port and a second 

port; 5 
a channel adapter coupled between the first port of 

each of said FIFOs and a respective one of said 

channels; and 

DMA means coupled to the second port of each of ip 
said FIFOs and to said interchange bus, for trans- 
ferring data between said interchange bus and one 
of said FIFOs as specified by said local processor, 
said transfer control means in said local processor 
comprising means for controlling each of said j; 
channel adapters separately to transfer data be- 
tween the channel bus coupled to said channel 
adapter and the FIFO coupled to said channel 
adapter, and for controlling said DMA control- 
ler to transfer data between separately specified 20 
ones of said FIFOs and said interchange bus, said 
DMA means performing said transfers sequen- 
tially. 
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66. Apparatus according to claim 65, wherein said 
DMA means comprises a command memory and a 
DMA processor, said local processor having means for 
writing FIFO/interchange bus DMA commands into 
said command memory, each of said commands being 
specific to a given one said FIFOs and including an 
indication of the direction of data transfer between said 
interchange bus and said given FIFO, each of said 
FIFOs generating a ready status indication, said DMA 
processor controlling the data transfer specified in each 
of said commands sequentially after the corresponding 
FIFO indicates a ready status, and notifying said local 
processor upon completion of the data transfer specified 
in each of said commands. 

67. Apparatus according to claim 65 further compris- 
ing an additional FIFO coupled between said CPU bus 
and said DMA memory, said local processor further 
having means for transferring data between said CPU 
and said additional FIFO, and said DMA means being 
further for transferring data between said interchange 
bus and said additional FIFO in response to commands 
issued by said local processor. 
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ABSTRACT 



This is achieved in a computer system employing a multiple 
facility operating system architecture. The computer system 
includes a plurality of processor units for implementing a 
predetermined set of peer-level facilities wherein each peer- 
level facility includes a plurality of related functions and a 
communications bus for interconnecting the processor units. 
Each of the processor units includes a centtal processor and 
the stored program that, upon execution, provides for the 
implementation of a predetermined peer-level facility of the 
predetermined set of peer-level facihties, and for performing 
a multi-tasking interface function. The multi-tasking inter- 
face function is responsive to control messages for selecting 
for execution fimctions of llic predetermined peer-level 
facility and that is responsive to the predetermined pccr- 
level facility for providing control messages to request or to 
respond to the performance of functions of another peer- 
level facility of the computer system. The multi-tasking 
interface functions of each of the plurality of processor units 
communicate among one another via the network bus. 

17 Claims, 9 Drdwing Sheets 
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MULTIPLE SOFTWARE-FACILITY 
COMPONENT OPERATING SYSTEM FOR 
CO-OPERATIVE PROCESSOR CONTROL 
WITHIN A MULTIPROCESSOR COMPUTER 
SYSTEM 

This is a division of U.S. patent application Ser. No. 
08/225,356, filed Apr. 8, f 994, now U.S. Pat. No. 5,485,579, 
which is a continuation of U.S. patent application Ser. No. 
07/875,585, filed Apr. 28, f992, abandoned, which is a lo 
continuation of Ser. No. 07/404,885, filed Sep. 8, 1989, 
abandoned. 

CROSS-REFERENCE TO RELATED 

APPLICATIONS 15 
The present application is related to the following U.S. 
Patent Applications: 

1. PARALLEL I/O NETWORK FILE SERVER 
ARCHITECTURE, inventors: John Row, Larry 
Boucher, William Pitts, and Steve Blightman; 

2. ENHANCED VMEBUS PROTOCOL UTIUZING 
SYNCHRONOUS HANDSHAKING AND BLOCK 
MODE DATA TRANSFER, inventor: Daryl D. Starr; 

2. BUS LOCKING FIFO MULTI-PROCESSOR COM- 2s 
MUNICATIONS SYSTEM UTILIZING PSEUDO- 
SYNCHRONOUS HANDSHAKING AND BLOCK 
MODE DATA TRANSFER invented by William Pitts, 
Stephen Blightman and Daryl D. Starr; 

3. IMPROVED FAST TRANSFER DIRECT MEMORY 30 
ACCESS CONTROLLER, inveiilcd l)y Daryl Starr, 
Stephen Blightman and Larry Boucher. 

The above applications are all assigned lo the assignee of 
the present invention and are all expressly incorporated 
herein by reference. 35 

1. Field of the Invention 

The present invention is generally related to operating 
system software architectures and, in particular, to a multi- 
processor operating system architecture based on multiple 
independent multi-tasldng process kernels. 40 

2. Background of the Invention 

The desire to improve productivity, in circumstances 
involving computers, is often realized by an improvement in 
computing throughput. Conventional file servers are recog- 
nized as being a limiting factor in the potential productivity 45 
associated with their client workstations. 

A file server is typically a conventional computer system 
coupled through a communications network, such as 
Ethernet, to client workstations and potentially other work- 
station file servers. The file server operates to provide a 50 
common resource base to its clients. The primary resource is 
typically the central storage and management of data files, 
but additional services including single point execution of 
certain types of programs, electronic mail delivery and 
gateway connection to other file servers and services are 55 
generally also provided. 

The client workstations may utilize any of a number of 
communication network protocols to interact with the file 
server. Perhaps the most commonly known, if not most 
widely used, protocol suite is TCP/IP This protocol suite 60 
and its supporting utihty programs, provide for the creation 
of logical communication channels between multiple client 
workstations and a file server. These communication chan- 
nels are generally optimized for point-to-point file transfers, 
i.e., multi-user file access control or activity administration 65 
is not provided. In addition, the supporting utiHty programs 
for these protocols impose a significant degree of user 
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interaction in order to initiate file transfers as well as the 
entire responsibility to manage the files once transferred. 

Recently, a number of network connected remote file 
system mechanisms has been developed to provide chents 
with a single consistent view of a file system of data files, 
even though portions of the file system may be physically 
distributed between a client's own local storage, one or more 
file servers or even other client workstations. These network 
file system mechanisms operate to hide the distinction 
between local data files and data files in the remotely 
distributed portions of the file system accessible only 
through the network. The advantages of such file system 
mechanisms include retention of multi-user access controls 
over the data files physically present on the server, to the 
extent intrinsically provided by a server, and a substantial 
simplification of a client workstation's view and productive 
utilization of the file system. 

Two implementations of a network file system mechanism 
are known as the network file system (NFS), available from 
Sim Microsystems, Inc., and the remote file sharing (RFS) 
system available from American Telephone and Telegraph, 
Inc. 

The immediate consequence of network file system 
mechanism is that they have served lo substantially increase 
the throughput requirements of the file server itself, as well 
as that of the communications network. Thus, the number of 
client workstations that can be served by a single file server 
must be balanced against the reduction in productivity 
resulting from increased file access response time and the 
potentially broader effects of a degradation in communica- 
liuii ehiciency due lo Ihe network operating al or above its 

An increase in the number of client workstations is 
conventionally handled by the addition of another file server, 
duplicating or possibly partitioning Ihc lile system between 
the file servers, and providing a dedicated high bandwidth 
network connection between the file servers. Thus, another 
consequence of the limited throughput of conventional file 
servers is a greater cost and configuration complexity of the 
file server base in relation to the number of client worksta- 
tions that can be effectively serviced. 

Another complicating factor, for many technical and 
practical reasons, is a requirement that the file server be 
capable of executing the same or a similar operating system 
as the attached client workstations. The reasons include the 
need to execute maintenance and monitoring programs on 
the file server, and to execute programs, such as database 
servers, that would excessively load the communications 
network if executed remotely from the required file data. 
Another often overlooked consideration is the need to avoid 
the cost of supporting an operating system that is unique to 
the file server. 

Given these considerations, the file server is typically 
otily a conventional general purpose computer with an 
extended data storage capacity and communications net- 
work interface that is little different from that present on 
each of the client workstations. Indeed, many file servers are 
no more than a physically repackaged workstation. 
Unfortunately, even with multiple communications network 
interlaces, such workstation-based computers are either 
incapable or inappropriate, from a cost/performance 
v iewpoint. 10 perform as a single file server to a large group 
of client workstations. 

The throughput offered by conventional general purpose 
computers, considered in terms of their sustained file system 
facfiity data transfer bandwidth potential, is limited by a 
number of factors, though primarily due lo the general 
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purpose nature of their design. Computer system design is 
necessarily dependent on the level and nature of the oper- 
ating system to be executed, the nature of the application 
load to be executed, and the degree of homogeneity of 
applications. For example, a computer system utilized solely 
for scientific computations may forego an operating system 
entirely, may be restricted to a single user at a time, and 
employ specialized computation hardware optimized for the 
anticipated highly homogeneous applications. Conversely, 
where an operating system is required, the system design 
typically calls for the utilization of dedicated peripheral 
controllers, operated under the control of a single processor 
executing the operating system, in an effort to reduce the 
peripheral control processing overhead of the system's 
single primary processor. Such is the design of most con- 
ventional file servers. 

A recurring theme in the design of general purpose 
computer systems is to increase the number of active pri- 
mary processors. In the simplest analysis, a linear improve- 
ment in the throughput performance of the computer system 
might be expected. However, utiKzation of increasing num- 
bers of primary processors is typically thwarted by the 
greater growth of control overhead and contention for com- 
mon peripheral resources. Indeed, the net improvement in 
throughput is often seen to increase slightly before declining 
rapidly as the number of processors is increased. 

SUMMAin- Ol I I IL INVLN HON 

provide an operating system architecture for the control of a 
multi-processor system to provide an efficient, expandable 
computer system for servicing network file system requests. 

This is achieved in a computer system employing a 
multiple facihty operating system architecture. The com- 
puter system includes a plurality of processor units for 
implementing a predetermined set of peer-level facilities, 
wherein each peer-level facility implements a plurality of 
related functions, and a communications bus for intercon- 
necting the processor units. Each of the processor units 
includes a central processor and a stored program that, upon 
execution, provides for the implementation of a predeter- 
mined pccr-lcvcl facility and for implementing a multi- 
tasking interface function. The multi-tasking interface func- 
tion is responsive to control messages for selecting for 
execution functions of the predetermined peer-level facility. 
The multi-tasking interface function is also responsive to the 
predetermined peer-level facihty for providing control mes- 
sages to request or to respond to the performance of func- 
tions of another peer-level facflity of the computer system. 
The multi-tasking interface fiinctions of each of the plurality 
of processor units communicate among one another via the 
network bus. 

Thus, in a preferred embodiment of the present invention, 
the set of peer-level facilities includes network 
communications, file system control, storage control and a 
local host operating system. 

An advantage of the present invention is that it provides 
for the implementation of multiple facilities, each instance 
on a respective processor, all within a single cohesive system 
while incurring little additional control overhead in order to 
maintain operational coherency. 

Another advantage of the present invention is that direct 
peer to peer-level facility communication is supported in 
order to minimizie overhead in processing network file 
system requests. 

A further advantage of the present invention is that it 
realizes a computer system software architecture that is 
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readily expandable to include multiple instances of each 
peer-level facihty, and respective peer-level processors, in a 
single cohesive operating system environment including 
direct peer to peer-level facility communications between 

5 like facihties. 

Yet another advantage of the present invention is that it 
may include an operating system as a facility operating 
concurrently and without conflict with the otherwise inde- 
pendent peer to peer-level facility communications of the 

10 other peer-level facilities. The operating system peer-level 
facility may itself be a conventional operating system suit- 
ably compatible with the workstation operating systems so 
as to maintain compatibility with "standard" file server 
operating systems. The operating system peer-level facility 

15 may be used to handle exception conditions from the other 
peer-level facilities including handling of non-network file 
system requests. Consequently, the multiple facility operat- 
ing sy,stem architecture of the present invention appears to 
client workstations as a convention.il, single iirocessor file 

A still further advantage of the present invention is that it 
provides a message -based operating system architecture 
framework for the support of multiple, specialized peer- 
level facilities within a single cohesive computer system; a 
25 capability particularly adaptable for implementation of a 
high-performance, high-throughput file server. 

BRIEF DESCRIPTION OF THE DRAWINGS 
These and other attendant advantages and features of the 
30 present invention will become apparent and readily appre- 
ciated as the same becomes better understood by reference 
to the following detailed description when considered in 
conjunction with the accompanying drawings, in which like 
reference numerals indicate like parts throughout the figures 
35 thereof, and wherein: 

FIG. 1 is a simplified block diagram of a preferred 
computer system architecture for impletnenting the multiple 
facihty operating system architecture of the present inven- 

40 FIG. 2 is a block diagram of a network communications 
processor suitable for implementing a network communica- 
tions peer-level facility in accordance with a preferred 
embodiment of the present invention; 

FIG. 3 is a block diagram of a file system processor 
suitable for implementing a file system controUer peer-level 
facility in accordance with a preferred embodiment of the 
present invention; 

FIG. 4 is a block diagram of a storage processor suitable 
for implementing a storage peer-level facihty in accordance 
with a preferred embodiment of the present invention; 

FIG. 5 is simplified block diagram of a primary memory 
array suitable for use as a shared memory store in a preferred 
embodiment of the present invention; 

FIG. 6 is a block diagram of the multiple facility operating 
system architecture configured in accordance with a pre- 
ferred embodiment of the present invention; 

FIG. 7 is a representation of a message descriptor passed 
between peer-level facilities to identif)' the location of a 
gQ message; 

FIG. 8 is a representation of a peer-level facility message 
as used in a preferred embodiment of the present invention; 

FIG. 9 is a simplified representation of a conventional 
program function call; 
65 FIG. 10 is a simplified representation of an inter-facUity 
function call in accordance with the preferred embodiment 
of the present invention; 



FIG. 11 is a control state diagram illustrating the interface 
functions of two peer-level facilities in accordance with a 
preferred embodiment of the present invention; 

FIG. 12 is an illustration of a data flow for an LPS read 
request through the peer-level facilities of a preferred 5 
embodiment of the present invention; 

FIG. 13 is an illustration of a data flow for an LFS write 
request through the peer-level facilities of a preferred 
embodiment of the present invention; 

FIG. 14 illustrates the data flow of a non-LFS data packet 
between the network communication and local host peer- 
level facilities in accordance with a preferred embodiment of 
the present invention; and 

FIG. 15 illustrates the data flow of a data packet routed j5 
between two network communications peer-level facilities 
in accordance with a preferred embodiment of the present 
invention. 



While the present invention is broadly apphcable to a 
wide variety of hardware architectures, and its software 
architecture may be represented and implemented in a 
variety of specific manners, the present invention may be 2 
best understood from an understanding of its preferred 
embodiment. 

I. System Architecture Overview 
A. Hardware Architecture Overview 

A block diagram representing the preferred embodiment 3 
of the hardware support for the present invention, generally 
indicated by the reference numeral 10, is provided in FIG. 1. 
The architecture of the preferred hardware system 10 is 
described in the above-identified related application entitled 
PARALLEL I/O NETWORK FILE SERVER ARCHITEC- ? 
TURE; which application is expressly incorporated herein 
by reference. 

The hardware components of the system 10 include 
multiple instances of network controllers 12, file system 
controllers 14, and mass storage processors, 16, intercon- 4 
nected by a high-bandwidth baclq)lane bus 22. Each of these 
controllers 12, 14, 16 preferably include a high performance 
processor and local program store, thereby minimizing their 
need to access the bus 22. Rather, bus 22 accesses by the 
controllers 12, 14, 16 are substantially limited to transfer 4 
accesses as required to transfer control ini'ornialion and 
client workstation data between the controllers 12, 14, 16 
system memory 18, and a local host processor 20, when 
necessary. 

The illustrated preferred system 10 configuration includes 5 
four network controllers 12^.4, two file controllers 14i_2, two 
mass storage processors I61.2, a bank of four system 
memory cards I81.4, and a host processor 20 coupled to the 
backplane bus 22. The invention, however, is not limited to 
this number and type of processors. Rather, six or more 5 
network communications processors 12 and two or more 
host processors 20 could be implemented within the scope cf 
the present invention. 

Each network communications processor (NP) 12^.4 pref- 
erably includes a Motorola 68020 processor for supporting 6 
two independent Ethernet network connections, shown as 
the network pairs 261-264. Each of the network connections 
directly support the ten megabit per second data rate speci- 
fied for a conventional individual Ethernet network connec- 
tion. The preferred hardware embodiment of the present 6 
invention thus realizes a combined maximum data through- 
put potential of 80 megabits per second. 



The file system processors (FP) 14i.2, intended to operate 
primarily as a specialized compute engines, each include a 
high-performance Motorola 68020 based microprocessor, 
four megabytes of local data store and a smaller quarter- 
megabyte high-speed program memory store. 

The storage processors (SP) 16^ 2 function as inteUigent 
small computer system interface (SCSI) controllers. Each 
includes a Motorola 68020 micro-processor, a local program 
and data memory, and an array of ten parallel SCSI channels. 
Drive arrays 24j_2 are coupled to the storage processors 
I61.2 to pro\'ide mass storage. Preferably, the drive arrays 
24i.2 are ten unit-wide arrays of SCSI storage devices 
uniformly from one to three units deep. The preferred 
embodiment of the present invention uses conventional 768 
megabyte SV-i-inch hard disk drives for each unit of the 
arrays 24-1.2. Thus, each drive array level achieves a storage 
capacity of approximately 6 gigabytes, with each storage 
processor readily supporting a total of 18 gigabytes. 
Consequently, a system 10 is capable of reahzing a total 
combined data storage capacity of 36 gigabytes. 

The local host processor 20, in the preferred embodiments 
of the present invention, is a Sun central processor card, 
model Sun 3E120, manufactured and distributed by Sun 
Microsystems, Inc. 

Finally, the system memory cards 18 each provide 48 
megabytes of 32-bit memory for shared use within the 
computer system 10. The memory is logically visible to each 
of the processors of the system 10. 

A VME bus 22 is used in the preferred embodiments of 
the present invention to interconnect the network commu- 
nication processors 12, file syslcni processors 14, storage 
processors 16, primary memory 18, and host processor 20. 
I'he hardware control logic for controlling Ihc VME bus 22, 
at least as implemented on the network communication 
processor 12 and storage processor 16, implements a bus 
ma.ster fast transfer protocol in addition lo the conventional 
VME transfer protocols. The system memory 18 corre- 
spondingly implements a modified slave VME bus control 
logic to allow the system memory 18 to also act as the fast 
data transfer data source or destination for the network 
communication processors 12 and storage processors 16. 
The fast transfer protocol is described in the above-identified 
related application entitled "ENHANCED VMEBUS PRO- 
TOCOL UTILIZING SYNCHRONOUS HANDSHAKING 
AND BLOCK MODE DATA TRANSFER"; which appli 
calinn is cvprc^sly incorporated herein by reference. 

It slioiild be understood that, while the system 10 cor 
iiguratiou represents the initially preferred maximum hard- 
ware configuration, the present invention is not limited to the 
preferred number or type of controllers, the preferred size 
and type of disk drives or use of the preferred fast data 
transfer VME protocol. 
B. Software Architecture Overview 

Although applicable to a wide variety of primary, or full 
function, operating systems such as MVS and VMS, the 
preferred embodiment of the present invention is premised 
on the Unix operating system as distributed under license by 
American telephone and telegraph, Inc. and specifically the 
SunOS version of the Unix operating system, as available 
from Sun Microsystems, Inc. The architecture of the Unix 
operating system has been the subject of substantial aca- 
demic study and many published works including "The 
Design of the Unix Operating System", Maurice J. Bach, 
Prentice HaU, Inc., 1986. 

In brief, the Unix operating system is organized around a 
non-preemptive, multi-tasking, multi-user kernel that imple- 
ments a simple file-oriented conceptual model of a file 
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system. Central to the model is a virtual file system (VFS) The data link (DL) layer manages the transfer and receipt 

interface that operates to provide a uniform file oriented, of data packets based on packet frame information. Often 

multiple file system environment for both local and remote this layer is referred to as a device driver, since it contains 

files. the low level software control interface to the specific 

Connected to the virtual file system is the Unix file system 5 communications hardware, including program control of 

(UFS). The UFS allows physical devices, pseudo-devices low level data transmission error correction/handling and 

and other logical devices to appear and be treated, from a data flow control. As such, it presents a hardware indepen- 

client's perspective, as simple files within the file system dent interface lo the IP layer. 

model. The UFS interlaces to the VFS to receive and Finally, the physical layer, an Ethernet controller, pro- 
respond to file oriented requests such as to obtain the lO vides a hardware interface to the network physical trans- 
attributes of a file, the stored parameters of a physical or mission medirmi. 

logical device, and, of course, to read and write data. In The conventional NFS stack, as unplemented for the 

carrying out these functions, the UFS interacts with a low uniprocessor VAX architecture, is available in source code 

level software device driver that is directly responsible for form under license from Sun Microsystems, Inc. 

an attached physical mass storage device. The UFS handles 15 The preferred embodiment of the present invention uti- 

all operations necessary to resolve logical file oriented lizes the conventional SunOS Unix kernel, the Sun/VAX 

operations, as passed from the VFS, down to the level of a reference release of the UFS, and the SunA'AX reference 

logical disk sector read or write request. release of the NFS stack as its operating system platform. 

The VFS. in order to integrate access to remote lilcs into Fhe present invention estabhshes an mstantiation of the NFS 
the file system model, provides a connection pciint Ic i stack as an independent, i.e., separately executed, software 
network communications through the network lilt s\ siLin entity separate from the Unix kernel. Instantiations of the 
mechanism, if available. I'he preferred network tile system UFS and the mass storage device driver are also established 
mechanism, NFS, is itself premised on the existence of a as respective independent software entities, again separate 
series of communication protocol layers that, inclusive of from the Unix kernel. These entities, or peer-level facilities, 
NFS and within the context of the present invention, can be 25 are each provided with an interface that supports direct 
referred to as an NFS stack. These layers, in addition to an communication between one another. This interface, or 
NFS "layer," typically include a series of protocol handling messaging kernel layer, includes a message passing, multi- 
layers generally consistent with the International Standards tasking kernel. The messaging kernel's layers are tailored to 
Organization's Open Systems Interconnection (ISO/OSI) each type of peer-level facility in order to support the 
model. The OSI model has been the subject of many 30 specific facility's functions. The provision for multi-tasking 
publications, both regarding the conceptual aspects of the operation allows the peer-level facilities to manage multiple 
model as well as specific implementations, including "Com- concurrent processes. Messages arc directed lo other peer- 
puter Networks, 2nd Edition", Andrew S. Tanenbaum, Pren- level facilities based upon the nature of the function 
tice Hall, 1988. requested. Thus, for NFS file system requests, request mes- 

In summary, the OSI layers utilized by the present inven- 35 sages may be passed from an NFS network communications 

tion include all seven layers described in the OSI reference peer-level facility directly to a UFS file system peer-level 

model: application, presentation, session, transport, facility and, as necessary, then to the mass storage pccr-lcvcl 

network, data link and physical layers. These layers are facility. The relevant data path is between the NFS network 

stmimarized below, in terms of their general purpose, func- communications peer-level facility and the mass storage 

tion and implementation for purposes of the present inven- 40 peer-level facility by way of the VME shared address space 

tion. primary memory. Consequently, the number of peer-level 

The application layer protocol, NFS, provides a set of facilities is not logically bounded and servicing of the most 
remote procedure call definitions, for use in both server and common tyipe of client workstation file system needs is 
client oriented contexts, to provide network file services. As satisfied while requiring only a minimum amount of pro- 
such, the NFS layer provides a link between the VFS of the 45 cessing. 

Unix kernel and the presentation protocol layer. Finally, a Unix kernel, including its own NFS stack, UFS, 

The presentation layer protocol, provided as an external and mass storage device driver, is established as a another 
data representation (XDR) layer, defines a common descrip- peer-level facility. As with the other peer-level facifities, this 
tion and encoding of data as necessary to allow transfer of operating system facility is provided with a multi-tasking 
data between different computer architectures. The XDR is 50 interface for interacting concurrenfly with the other peer- 
thus responsible for syntax and semantic translation between level facilities as just another entity within the system 10. 
the data representations of heterogeneous computer systems. While the operating system kernel peer-level facility is not 

The session layer protocol, implemented as a remote involved in the immediate servicing of most NFS requests, 

procedure call (RFC) layer, provides a remote procedure call it interacts with the NFS stack peer-level facility to perform 

capability between a client process and a server process. In 55 general management of the ARP and IP data bases, the initial 

a conventional file server, the NFS layer connects through NFS file system access requests from a client workstation, 

the XDR layer to the RPC layer in a server context to support and to handle any non-NFS type requests that might be 

the file oriented data transfers and related requests of a received by the NFS stack peer-level faciHty. 

network client. II. Peer-level Processors 

The transport layer protocol, typically implemented as 60 A. Network Control Processor 

either a user datagram protocol (UDP) or transmission A block diagram of the preferred network control proces- 

control protocol (TCP) layer, provides for a simple connec- sor is shown in FIG. 2. The network controller 12 includes 

tionless datagram delivery service. NFS uses UDP. a 32-bit central processing unit (CPU) 30 coupled to a local 

The network layer protocol, implemented as an internet CPU bus 32 that includes address, control and data fines, 

protocol (IP) layer, performs internet routing, based on 65 The CPU is preferably a Motorola 68020 processor. The data 

address mappings stored in an IP routing database, and data fine portion of the CPU bus 32 is 32 bits wide. All of the 

packet fragmentation and reassembly. elements coupled to the local bus 32 of the network con- 



trailer 12 arc memory mapped from the perspective of the 
CPU 30. This is enabled by a buffer 34 that connects the 
local bus 32 to a boot PROM 38. The boot PROM 38 is 
utilized to store a boot program and its necessary start-up 
and operating parameters. Another buffer 40 allows the CPU 5 
30 to separately address a pair of Ethernet local area network 
(LAN) controllers 42, 44, their local data packet memories 
46, 48, and their associated packet direct memory access 
(DMA) controllers 50, 52, via two parallel address, control, 
and 16-bit wide data buses 54, 56. The LAN controllers 42, 10 
44 are programmed by the CPU 30 to utilize their respective 
local buffer memories 46, 48 for the storage and retrieval of 
data packets as transferred via the Ethernet connections 26. 
The DMA controllers 50, 52 are programmed by the CPU 30 
to transfer data packets between the buffer memories 46, 48 15 
and a respective pair of multiplexing FIFOs 58, 60 also 
connected to the LAN buses 54, 56. The multiplexing FIFOs 
58, 60 each include a 16-bit to 32-bit wide data multiplexer/ 
demultiplexer, coupled to the data portion of the LAN buses 
54, 56, and a pair of internal FIFO buffers. Thus, for example 20 
in the preferred embodiment of the present invention, a first 
32-bit wide internal FIFO is coupled through the multiplexer 
to the 16-bit wide LAN bus 54. The second internal FIFO, 
also 32-bit wide, is coupled to a secondary data bus 62. 
These internal FIFO buffers of the multiplexing FIFO 58, as 25 
well as those of the multiplexing FIFO 60, may be swapped 
between their logical connections to the LAN buses, 54, 56 
and the secondary data bus 62. Thus, a large difference in the 
data teansfer rate of the LAN buses 54, 60 and the secondary 
data bus 62 can be maintained for a burst data length equal 30 
to the depth of the internal FIFOs 58, 60. 

A high speed DMA controller 64, controlled by the CPU 
30, is provided to direct the operation of the multiplexing 
FIFOs 58, 60 as well as an enhanced VME control logic 
block 66, through which the data provided on the secondary 35 
data bus 62 is communicated to the data lines of the VME 
bus 22. The purpose of the multiplexing FIFOs 58, 60, 
besides acting as a 16-bit to 32-bit multiplexer and buffer, is 
to ultimately support the data transfer rate of the fast transfer 
mode of the enhanced VME control logic block 66. 40 

Also connected to the local CPU data bus 32 is a quarter 
megabyte block of local shared memory 68, a buffer 70, and 
a third multiplexing FIFO 74. The memory 68 is shared in 
the sense that it also appears within the memory address 
space of the enhanced VME bus 22 by way of the enhanced 45 
VME control logic block 66 and buffer 70. The buffer 70 
preferably provides a bidirectional data path for transferring 
data between the secondary data bus 62 and the local CPU 
bus 32 and also includes a status register array for receiving 
and storing status words either from the CPU 30 or from the 50 
enhanced VME bus 22. The multiplexing FIFO 74, identical 
to the multiplexing FIFOs 58, 60, provides a higher speed, 
block-oriented data transfer capability for the CPU 30. 

Finally, a message descriptor FIFO 72 is connected 
between the secondary data bus 62 and the local CPU bus 55 
32. Preferably, the message descriptor FIFO 72 is addressed 
from the enhanced VME bus 22 as a single shared memory 
location for the receipt of message descriptors. Preferably 
the message descriptor FIFO 72 is 32-bit wide, single buffer 
FIFO with a 256-word storage capabiKty. In accordance with 60 
the preferred embodiments of the present invention, the 
message descriptor FIFO is described in detail in the above- 
referenced related application "BUS LOCKING FIFO 
MULTI-PROCESSOR COMMUNICAHONS SYSTEM"; 
which appKcation is hereby incorporated by reference. 65 
However, for purposes of completeness, an enhancement 
embodied in the enhanced VME control logic block 66 is 
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that it preemptively allows writes to the message descriptor 
FIFO 72 from the enhanced VME bus 22 unless the FIFO 72 
is fuU. Where a write to the message descriptor FIFO 72 
cannot be accepted, the enhanced VME control logic block 
66 immediately declines the write by issuing a VME bus 
error signal onto the enhanced VME bus. 

B. File System Control Processor 

The preferred architecture of a file system processor 14 60 
is shown in FIG. 3. A CPU 80, preferably a Motorola 68020 
processor, is connected via a local CPU address, control and 
32-bit wide data bus 82 to the various elements of the file 
controller 14. These principle elements include a 256 kilo- 
bytes of static RAM block 84, used for storing the file 
system control program, and a four megabyte dynamic 
RAM block 86 for storing local data, both connected directly 
to the local CPU bus 82. A buffer 88 couples the local CPU 
bus 82 to a secondary 32-bit wide data bus 90 that is, in turn, 
coupled through an enhanced VME control and logic block 
92 to the data bus lines of the VME bus 22. In addition to 
providing status register array storage, the buffer 88 allows 
the memory blocks 84, 86 to be accessible as local shared 
memory on the VME bus 22. A second buffer 94 is provided 
to logically position a boot PROM 96, containing the file 
controller initialization program, within the memory address 
map of the CPU 80. Finally, a single buffer message descrip- 
tor FIFO 98 is provided between the secondary data bus 90 
and the local CPU bus 82. The message descriptor FIFO 98 
is again provided to allow preemptive writes to the file 
controller 14 from the enhanced VME bus 22. 

C. Storage Control Processor 

A block diagram of a storage processor 16 is provided in 
FIG. 4. A (^PU 100, preferably a Motorola 68020 processor, 
is coupled through a local CPU address, control and 32-bit 
wide data bus 102 and a bulTer 104 to obtain access to a boot 
PKUM 106 and a double-bullcrcd multiplexing FIFO 108 
that i.s, in turn, connected to an internal peripheral data bus 
110. The internal peripheral data bus 110 is, in turn, coupled 
through a parallel channel array of double-buffered multi- 
plexing FIFOs 112i.io and SCSI channel controllers 114i.io. 
The SCSI controllers lU^.^g support the respective SCSI 
buses (SCSI0-SCSI9) that connect to a drive array 24. 

Control over the operation of the double buffer FIFO 
112i.iQ and SCSI controller 114j.jo arrays is ultimately by 
the CPU 100 via a memory-mapped buffer 116 and a first 
port of a dual ported SRAM command block 118. The 
second port of the SRAM block 118 is coupled to a DMA 
controller 120 that controls the low level transfer of data 
between the double-buffered FIFOs 108, 112i.io, a tempo- 
rary store buffer memory 122 and the enhanced VME bus 
22. In accordance with a preferred embodiment of the 
present invention, the DMA controller responds to com- 
mands posted by the CPU 100 in the dual-ported SRAM 
block 118 to select any of the double-buffered FIFOs 108, 
112i.io, the buffer memory 122, and the enhanced VME bus 
22 as a source or destination of a data block transfer. To 
accomplish this, the DMA controller 120 is coupled through 
.1 control bus 124 to the double buffered FIFOs 108, 112i.io, 
the SCSI controllers lU^_-,„, the buffer memory 122, a pair 
of secondary data bus buffers 126, 128, and an enhanced 
VME control and logic block 132. The buffers 126, 128 are 
used to route data by selectively coupling the internal 
peripheral data bus 110 to a secondary data bus 130 and the 
buffer memory 122. The DMA controller 120, as imple- 
mented in accordance with a preferred embodiment of the 
present invention, is described in detail in the above- 
referenced related application "IMPROVED FAST TRANS- 
FER DIRECT MEMORY ACCESS CONTROLLER"; 
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which apphcation is hereby incorporated by reference. 
Finally, a one megabyte local shared memory block 134, a 
high speed buffer and register array 136, and a preemptive 
write message descriptor FIFO 138 are provided connected 
direcfly to the local CPU data bus 102. The buffer 136 is also 5 
coupled to the secondary data bus 130, while the message 
descriptor FIFO 138 is coupled to the secondary data bus 
130. 

D. Primary Memory Array 

FIG. 5 provides a simplified block diagram of the pre- 10 
ferred architecture of a memory card 18. Each memory card 
18 operates as a slave on the enhanced VME bus and 
therefore requires no on-board CPU. Rather, a timing control 
block 150 is sufiEcient to provide the necessary slave control 
operations. In particular, the timing control block 150, in 15 
response to control signals from the control portion of the 
enhanced VME bus 22 enables a 32-bit wide buffer 152 for 
an appropriate direction transfer of 32-bit data between the 
enhanced VME bus 22 and a multiplexer unit 154. The 
multiplexer 154 provides a mukiplexing and demultiplexing 20 
function, depending on data transfer direction, for a six 
megabit by seventy-two bit word memory array 156. An 
error correction code (ECC) generation and testing unit 158 
is coupled to the multiplexer 154 to generate or verify, again 
depending on transfer direction, eight bits of ECC data per 25 
memory array word. The status of each ECC verification 
operation is provided back to the timing control block 150. 

E. Host Processor 

The host processor 20, as shown in FIG. 1, is a conven- 
tional Sun 3E120 processor. Due to the conventional design 30 
of this product, a software erriulalioii of a message descriptor 
FIFO is performed in a reserved portion of the local host 
processor's shared memory space. 'I'his software message 
descriptor FIFO is intended to provide the functionality of 
the message dcscnplor FIFOs 72, 98, and 138. A preferred 
embodiment of the present invention inchides a local host 
processor 20', not shown, that includes a hardware preemp- 
tive write message descriptor FIFO, but that is otherwise 
fimctionally equivalent to the processor 20. 
III. Peer-level Facility Architecture 40 
A. Peer-Level Facility Functions 

FIG. 6 provides an illustration of the multiple peer-level 
facility architectiu-e of the present invention. However, otily 
single instantiations of the preferred set of the peer-level 
facilities are shown for purposes of clarity. 45 

The peer-level facilities include the network communica- 
tions faciKty (NC) 162, file system facility (FS) 164, storage 
facility (S) 166 and host facility (H) 168. For completeness, 
the memory 18 is illustrated as a logical resource 18' and, 
similarly, the disk array 24 as a resource 24'. 50 

The network communications facility 162 includes a 
messaging kernel layer 178 and an NFS stack. The messag- 
ing kernel layer 178 includes a multi-tasking kernel that 
supports multiple processes. Logically concurrent execu- 
tions of the code making up the NFS stack are supported by 55 
reference to the process context in which execution by the 
peer-level processor is performed. Each process is uniquely 
identified by a process lU (PIU). Context execution switches 
by the peer-level processor are controlled by a process 
scheduler embedded in the facility's multi-tasking kernel. A 60 
process may be "active" — at a minimum, where process 
execution by the peer-level processor continues until a 
resource or condition required for continued execution is 
unavailable. A process is "blocked" when waiting for notice 
of availability of such resource or condition. For the network 65 
communications facility 162, within the general context of 
the present invention, the primary source of process block- 
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ing is in the network and lower layers where a NC process 
will wait, executing briefly upon receipt of each of a series 
of packet frames, until sufficient packet frames are received 
to be assembled into a complete datagram transferrable to a 
higher level layer. At the opposite extreme, a NC process 
will block upon requesting a file system or local host 
function to be performed, i.e., any function controlled or 
implemented by another peer-level facility. 

The messaging kernel layer 178, like all of the messaging 
kernel layers of the present invention, allocates processes to 
handle respective communication transactions. In allocating 
a process, the messaging kernel layer 178 transfers a pre- 
viously blocked process, from a queue of such processes, to 
a queue of active processes scheduled for execution by the 
multi-tasking kernel. At the conclusion of a communication 
transaction, a process is deallocated by returning the process 
to the queue of blocked processes. 

As a new communication transaction is initiated, an 
address or process ID of an allocated process becomes the 
distinguishing datum by which the subsequent transactions 
are correlated to the relevant, i.e., proper handling, process. 
For example, where a client workstation initiates a new 
communication transaction, it provides its Ethernet address. 
The network communication facility, will store and 
subsequently, in responding to the request, utilize the cli- 
ent's Ethernet address to direct the response back to the 
specific requesting chent. 

The NC facility similarly provides a unique facility ID 
and the PID of its relevant process to another peer-level 
facility as part of any request necessary to complete a 
client's request. Thus, an NC facility process may block with 
certainty that the responding peer-level facility can direct its 
response back to the relevant process of the network com- 

Thc nctWOTk and lower level layers of the NFS stack 
necessary to support the logical Ethernet connections 26' are 
generally illustrated together as an IP layer 172 and data link 
layer 170. The IP layer 172, coupled to the IP route database 
1'74, is used to initially distinguish between NFS and non- 
NFS client requests. NFS requests are communicated to an 
NFS server 1'76 that includes the remaining layers of the 
NFS stack. The NFS server 176, in turn, communicates NFS 
requests to the network communications messaging kernel 
layer 178. By the nature of the call, the messaging kernel 
layer 178 is able to discern between NFS request calls, 
non-NFS calls from the IP layer 172 and network caUs 
received directly from the network layers 170. 

For the specific instance of NFS requests, making up the 
large majority of requests handled by the network commu- 
nications facility 162, the relevant NC process calls the 
messaging kernel layer 178 to issue a corresponding mes- 
sage to the messaging kernel layer 180 of the file system 
facility 164. The relevant NC process is blocked pending a 
reply message and, possibly, a data transfer. That is, when 
the messaging kernel layer 178 receives the NFS request 
call, a specific inter-facility message is prepared and passed 
to the messaging kernel layer 180 with sufficient information 
to identify the request and the facitity that sourced the 
request. As illustrated, messages are exchanged between the 
various messaging kernel layers of the system 160. 
However, the messages are in fact transferred physically via 
the enhanced VME bus connecting the peer- level processors 
upon which the specific peer-level facilities are executing. 
The physical to logical relationship of peer-level facilities to 
peer-level processors is established upon the initialization of 
the system 160 by providing each of the messaging kernel 
layers with the relevant message descriptor FIFO addresses 
of the peer-level processors. 
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a message received, the messaging kernel 
layer 180 allocates a FS process within its multi-tasking 
environment to handle the communication transaction. This 
active FS process is used to call, carrying with it the received 
message contents, a local file system (LFS) server 182. This 5 
LFS server 182 is, in essence, an unmodified instantiation 
184 of the UFS. Calls, in turn, issued by this UFS 182, 
ultimately intended for a device driver of a mass storage 
device, are directed back to the messaging kernel layer 180. 
The messaging kemel layer distinguishes such device driver 
related functions being requested by the nature of the 
function call. The messaging kernel layer 180 blocks the 
relevant FS process while another inter-processor message is 
prepared and passed to a messaging kernel layer 186 of the 
storage facility 166. 

Since the storage facility 166 is also required lo track 
many requests at any one time, a single manager process is 
used to receive messages. For throughput efficiency, this .-i 
manager process responds to FIFO interrupts, indicating thai 
a corresponding message descriptor has just been written 10 
the SP FIFO, and immediately initiates the SP proces: 
operation necessary to respond to tlie request. Tlius, ilie 
currently preferred S facility handles messages at interrupt 
time and not in the context of separately allocated processes. 
However, the messaging kernel layer 186 could alternately 
allocate an S worker process to service each received 25 
message request. 

The message provided from the file system facility 164 
includes the necessary information to specify the particular 
function required of the storage facility in order to satisfy the 
request. Within the context of the allocated active S process, 30 
the messaging kernel layer 186 calls the request correspond- 
ing function of a device driver 188. 

Depending on the availaliility and nature of the resource 
requested, the device driver 188 will, lor example, direct the 
requested data to he retrieved from the disk array resource 33 
24'. As data is returned via the device driver layer 188, the 
relevant S process of the messaging kemel layer 186 directs 
the transfer of the data into the memory resource 18'. 

In accordance with the preferred embodiments of the 
present invention, the substantial bulk of the memory 40 
resource 18' is managed as an exclusive resource of the file 
system facility 164. Thus, for messages requesting the 
transfer of data to or from the disk array 24', the file system 
facility 164 provides an appropriate shared memory address 
referencing a suitably allocated portion of the nicnmrv 45 
resource 18'. Thus, as data is retrieved from the tlisk ana\ 
24', the relevant S process of the messaging kernel layer liS6 
will direct the transfer of data from the device dri\'cr layer 
188 to the message designated location within the memory 
resource 18', as illustrated by the data path 190. 50 

Once the data transfer is complete, the relevant S process 
"returns" to the messaging kemel layer 186 and a reply 
message is prepared and issued by the messaging kernel 
layer 186 to the messaging kernel layer 180. The relevant S 
process may then be deallocated by the messaging kernel 55 
layer 186. 

In response to this reply message, the messaging kernel 
layer 180 unblocks its relevant FS process, i.e., the process 
that requested the S facihty data transfer. This, m turn, 
results in the relevant FS process executing the UFS 182 and 60 
eventually issuing a return to the messaging kernel layer 180 
indicating that the requested function has been completed. In 
response, the messaging kernel layer 180 prepares and 
issues a reply message on behalf of the relevant FS process 
to the messaging keznel layer 178; this message will include 65 
the shared memory address of the requested data as stored 
within the memory resource 18'. 
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The messaging kernel layer 178 responds to the reply 
message from the file system facility 164 by unblocking the 
relevant NC process. Within that NC process's context, the 
messaging kernel layer 178 performs a return to the NFS 
server 176 with the shared memory address. The messaging 
kernel layer 178 transfers the data from the memory 
resource 18' via the indicated data path 192 to local stored 
memory for use by the NFS server layer 176. The data may 
then be processed through the NFS server layer 176, IP layer 
172 and the network and lower layers 170 into packets for 
provision onto the network 26' and directed to the originally 
requesting client workstation. 

Similarly, where data is received via the network layer 
170 as part of an NFS write transfer, the data is buffered and 
processed through the NFS server layer 176. When 
complete, a call by the NFS server 176 to the messaging 
kernel layer 178 results in the first message of an inter- 
tacility communication transaction being issued to the file 
system facility 164. The messaging kernel layer 180, on 
FS process to handle the request message, replies 
to the relevant NC process of the messaging kernel layer 178 
With an inlcr-facilily message containing a shared memory 
address within the memory resource 18'. The NFS data is 
then transferred from local shared memory via the data path 
192 by the messaging kernel 178. When this data transfer is 
complete, another inter-facility message is passed to the 
relevant FS process of the messaging kemel layer 180. That 
process is then unblocked and processes the data transfer 
request through the LFS/UFS 182. The UPS 182, in turn, 
initiates, as needed, inter-facility communication transac- 
tions through the messaging kernel layer 180 to prepare for 
and ultimately transfer the data from the memory resource 
18' via the data path 190 and device driver 188 to the disk 
array resource 24'. 

I hc host operating system facility 168 is a substantially 
complete implementation of the SunOS operating system 
including a TCPAP and NFS stack. A messaging kernel layer 
194, not unlike the messaging kernel layers 178, 180, 186 is 
provided to logically integrate the host facility 186 into the 
system 160. The operating system kemel portion of the 
facility 168 includes the VFS 196 and a standard instantia- 
tion of the UFS 198. The UFS 198 is, in turn, coupled to a 
mass storage device driver 200 that, in normal operation, 
provides for the support of UFS 198 requests by calhng the 
messaging kernel layer 194 to issue inter-facihty messages 
10 the storage facility 166. Thus, the storage facility 166 does 
not functionally differentiate between the local host facility 
168 and the file system facihty 164 except during the initial 
phase of boolup. Rather, both generally appear as unique but 
otherwise undifferentiated logical clients of the storage 
facihty 166. 

Also interfaced to the VFS 196 is a conventional client 
instantiation of an NFS layer 202. That is, the NFS layer 202 
is oriented as a client for processing client requests directed 
to another file server connected through a network commu- 
nications facihty. These requests are handled via a TCP/UDP 
layer 204 of a largely conventional instantiation of the Sun 
NFS chent stack. Connected to the layer 204 are the IP and 
data link layers 206. The IP and data hnk layers 206 are 
modified to commrmicate directly with the messaging kernel 
layer 194. Messages from the messaging kernel layer 194, 
initiated in response to calls directly from the data link layer 
206 are logically directed by the messaging kernel 178 
directly to the data link layer 170 of a network communi- 
cations faciUt)'. Similarly, calls from the IP layer 172, 
recognized as not NFS requests of a local file system, are 
passed through the messaging kernel layers 178 and 194 
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directly to the TCP/UDP layers 204. In accordance with the 
preferred embodiments of the present invention, the 
responses by the host facihty 168 in such circtimstances are 
processed back through the entire host TCP/IP stack 214, 
204, 206, the messaging kernel layers 194, 178, and finally 5 
the data hnk layer 170 of an NC facility 162. 

Ancillary to the IP and data hnk layers 206, a route 
database 208 is maintained under the control and direction 
of a conventional "routed" daemon application. This, and 
related daemons such as the "mounld", execute in the 
application program layer as background processes. In order 
to maintain coherency between the route database 208 and 
the route database 174 present in the network communica- 
tions facihty 162, a system call layer 212, provided as the 
interface between the application program layer and the 
kernel functions of the host facility 168, is modified in i-^ 
accordance with the present invention. The modification 
provides for the issuance of a message containing any 
update information directed to the route database 208, from 
the daemons, to be provided by an inter-facility communi- 
cation transaction from the messaging kernel layer 194 to 20 
the messaging kernel layer 178. Upon receipt of such a 
message, the messaging kernel layer 178 directs an appro- 
priate update to the route database 174. 

The system call layer 212 also provides for access to the 
TCPAJDP layers via a conventional interface layer 214 25 
known as sockets. Low level application programs may use 
the system call layer 212 to directly access the data storage 
system by calling directly on the device driver 200. The 
system call layer also interfaces with the VFS 196 for access 
to or by the NFS client 202 and the UFS 198. 30 

In addition, as provided l)y Ibc preferred crnbudirnciits of 
the present invention, the VlvS 196 also interfaces to a local 
file system (LFS) client layer 216. The conventional VFS 
196 implements a "mount" model for handling the logical 
relation between and access to multiple lile systems. By this 35 
model a file system is mounted with respect to a specific file 
system layer that interfaces with the VFS 196. The flic 
system is assigned a file system ID (FSID). File operations 
subsequently requested of the VFS 196 with regard to a 
FSID identified file system will be directed to the appropri- 40 
ate file system. 

In accordance with the present invention, the LFS. client 
layer 216 is utilized in the logical mounting of file systems 
mounted through the file system facility 164. That is, the 
host facUity's file oriented requests presented to the VFS 196 45 
are routed, based on their FSID, through the LFS client layer 
216 to the messaging kernel layer 194, and, in turn, to the 
messaging kernel layer 180 of the file system facility 164 for 
servicing by the UFS 182. The model is extended for 
handKng network file system requests. A client workstation 50 
may then issue a mount request for a file svstem previously 
exported through the VFS 196. The mount request is for- 
warded by a network communications facihty 162 ulti- 
mately to a mounted daemon running in the apphcation layer 
210 of the host facility 194. The mounted daemon response 55 
in turn provides the client with the FSID ol the lilc svslcm 
if the export is successful. Thereafter, the client s NFS file 
system requests received by the network communications 
facility 162 will be redirected, based on the FSID provided 
with the request, to the appropriate file system taciliiv 164 ou 
that has mounted the requested file system. 

Consequenfly, once a file system is mounted by the UFS 
182 and exported via the network communications and host 
facilities 162, 168, file oriented NF5 re 1 ss I r ihit lilc 
system need not be passed to or processed bv the host 65 
facility 168. Rather, such NFS requests are expediently 
routed directly to the appropriate file system lacility 164. 
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The primary benefits of the present invention should now 
be apparent. In addition to allowing multiple, independent 
instantiations of the network communication, file system, 
storage and host facilities 162, 164, 166, 168, the immediate 
requirements for all NFS requests may be serviced vvdthout 
involving the substantial performance overhead of the VFS 
196 and higher level portions of the conventional Unix 
operating system kernel. 

Finally, another aspect of the host facihty 168 is the 
provision for direct access to the messaging kernel layer 194 
or via the system call layer 212 as appropriate, by mainte- 
nance application programs when executed within the appli- 
cation program layer 210. These maintenance programs may 
be utilized to collect performance data from status accumu- 
lation data structures maintained by the messaging kernel 
layer 194 and, by utilizing corresponding inter-facility 
messages, the accumulated status information from status 
data structures in the messaging kernel layers 178, 180 and 
186. 

B. Messaging Kernel Layer Functions 

The messaging kernel layers 178, 180, 186 and 194 each 
include a small, efficient multi-tasking kernel. As such, it 
provides only fundamental operating system kernel services. 
These services include simple lightweight process 
scheduhng, message passing and memory allocation. A 
library of standard functions and processes provide services 
such as sleep( ), wakeup( ), error logging, and real time 
clocks in a manner substantially similar to those functions of 
a conventional Unix kernel. 

The list below summarizes the primary function primi- 
tives of the multi-tasking kernel provided in each of the 
messaging kernel layers 178, 180, 186 and 194. 



k__resolve(name) 
k_send(msg,pid) 

k__reply(msg) 
k__null_reply(msg) 



Returns tlie process ID for a 
Sends a message to a specified 



k_reply(msg) be 
message need no 
back.) 



The balance of the messaging kernel layers 178, 180, 186 
and 194 is made un of routines that presumptively 
implement, at least trom the perspective of the balance of the 
facihty. the functions that a given facdity might request of 
another. These routmes are premised on the function primi- 
tives provided by the multi-tasking kernel to provide the 
specific interface functions necessary to support the NFS 
k L FS t e le I It operating system. 

S 1 til tat alh I t m the functions for 

h I 1 II I tl 1 ferred to as "stub 
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(IK) Systi 



the peer-level 



)t messaging kernel 



)rperlorm diagnostics. A single 



17 

be suspended, i.e., the reply message held, while the 
ing messagiiig kernel layer initiates a separate comn 
tion transaction with another peer-level facility. Once the 
reply message of the second transaction is received, a 
properly reply to the initial communication transaction can 5 
then be made. 

1. Message Descriptors and Messages 

The transfer of a message between sending and receiving 
messaging kernel layers is, in turn, generally a two step 
process. The first step is for the sending messaging kernel 
layer to write a message descriptor to the receiving mes- 
saging kernel layer. This is accomplished by the message 
descriptor being written to the descriptor FIFO of the 
receiving peer-level processor 

The second step is for the message, as identified by the 
message descriptor, to be copied, either actually or 
implicitly, from the sending messaging kernel layer to the 
receiving messaging kemel layer. This copy, when actually 
performed, is a memory to memory copy from the shared 
memory space of the sending peer-level processor to that of 
the receiving peer-level processor. Depending on the nature 
of the communication transaction, the message copy wiU be 
actually performed by the sending or receiving peer-level 
processor, or implicitly by reference to the image of the 
original message kept by the messaging kemel layer that „ 
initiated a particular communication transaction. 

'llie message identified by a message descriptor is evalu- 
ated by the receiving messaging kernel layer to determine 
what is to be done with the message. A message descriptor 
as used by a preferred embodiment of the present invention 
is shown in FIG. 7. The message descriptor is, in essence, a 
single 32-bit word partitioned into two fields. The least 
significant field is used to store a descriptor modifier, while 
the high order 30-bit field provides a shared memory address 
to a message. The preferred values of the modifier field are 
given in Table 1. 

TABLE 1 
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identifier. The text of the message then follows with any 
necessary fill to reach a current maximum text limit. In the 
preferred embodiment of the present invention, the text 
length is 84 bytes. An inter-facUity communication (IFC) 
control data block is provided, again followed by any 
necessary fill characters needed to complete the 128-byte 
long message. This IFC control data preferably includes a 
copy of the address of the original message, the relevant 
sending and receiving (destination) process identifiers asso- 
ciated with the current message, and any queue hnks 
required to manage the structure while in memory. 

LXBLL 1 

^ K_M.SrTT\'PF, type; /* request code 

char msg[84]; 

vme_t addr; /* shared memory address of 

the original message */ 
PID ml6_sender_pid; /* PID of last sender. */ 
PID ml6_forward_pid;/* PID of last forwarder. */ 
PID ml6_dest_pid; /* PID of dest. process. */ 
/* FoUovsring value is LOCAL and need 

not be transferred. */ 

~^ ' link*/ ^ 

} mVlSG; 



This structure (K_MSG) includes the message type field 
(K^MSGTYPE), the message text (msg[ ]), and the IFC 

block (addr, ml6 sender pid, ml6 sender pid, ml6 

dest_pid, and ml6_link). This K_MSG structure is used to 
encapsulate specific messages, such as exemplified by a file 
system facility message structure (FS_STD_T) shown in 
Table 3. 



TABLE 3 



;mplary Specific Message Structure 



nowledgmj 



{READ,WRITE,EXEC} fc 



For request messages that are being sent, the receiving 
messaging kernel layer performs the message copy. For a 
message that is a reply to a prior message, the sending 
messaging kernel layer is effectively told whether a message 
copy must be performed. That is, where the contents of a 
message have not been changed by the receiving messaging 
kernel layer, an implicit copy may be performed by replying 
with a messaging descriptor that points to the original 
message image within the sending messaging kernel layer's 
local shared memory space. Similarly for forwarding type 
communication transactions the receiving messaging kernel 
layer performs the copy. A message forwarding transaction 
is completed when an acknowledgement message is pro- 
vided. The purpose of the acknowledgement is to notify the 
sending messaging kernel layer to know that it can return the 
reference message buffer to its free buffer pool. 

The preferred block format of a message is illustrated in 
FIG. 8. The message is a single data structure defined to 
occupy 128 bytes. The initial 32-bit word of the message 
encodes the message type and a unique peer-level facility 



50 ' 



The FS STD T structure is overlaid onto a K MSG 
structure with byte zero of both structures aligned. This 
composite message structure is created as part of the for- 

55 matting of a message prior to being sent. Other message 
stmctures, appropriate for particular message circumstances, 
may be used. However, all are consistent with the use of the 
K_MSG message and block format described above. 
2. IFC Message Generation 

60 The determination to send a message, and the nature of 
the message, is determined by the peer-level facilities. In 
particular, when a process executing on a peer-level proces- 
sor requires the support of another peer- level facility, such as 
to store or retrieve data or to handle some condition that it 

65 alone cannot service, the peer-level facihty issues a message 
requesting the required function or support. This message, in 
accordance with the present invention, is generally initiated 
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in response to the same function call that the facihty would communications facility that either can directly call or be 

make in a uniprocessor configuration of the prior art. That is, called by the messaging kernel layer 178. Consequently, 

in a conventional single processor software system, execu- each call to the messaging kernel layer is uniquely 

tion of a desired function may be achieved by calling an identifiable, both in type of function requested as well as the 

appropriate routine, that, in turn, determines and calls its 5 context of the process that makes the call. Where the 

own service routines. This is illustrated in FIG. 9. A function messaging kernel layer calls a function implemented by the 

call to a routine A, illustrated by the arrow 300, may select NFS stack of its network communications facihty, a process 

and call 302 a routine B. As may be necessary to carry out is allocated to allow the call to operate in a unique context, 

its function, the routine B may call 304 still further routines. Thus, the call to or by a stub routine is identifiable by the 

Ultimately, any functions called by the routine B return to 10 process ID, PID, of the calling or responding process, 

the function B which returns to the function A The function respectively. 

A then itself returns with the requested function call having The caUing process to any of the stub routines Al-X, upon 

been completed. making the call, begins executing in the messaging kernel 

In accordance with the present invention, the various layer. This execution services the call by receiving the 

messaging kernels layers have been provided to allow the 15 function call data and preparing a corresponding message, 

independent peer-level facilities to be executed on respec- This is shown, for purposes of illustrating the logical 

five processors. This is generally illustrated in FIG. 10 by the process, as handled by the logical call format bubbles Al-X. 

inclusion of the functions A' and B' representing the mes- A message buffer is allocated and attached to a message 

saging kernel layers of two peer-level facilities. A function queue. Depending on the particular stub routine called, the 

call 302 from the routine A is made to the messaging kernel 20 contents of the message may contain different data deiined 

A. Although A does not implement the specific function by different specific message data structures. That is, each 

called, a stub routine is provided to allow the messaging message is formatted by the appropriate call format bubble 

kernel layer A' to implicitly identify function requested by Al-X, using the ftmction call data and the PID of the calling 

the routine A and to receive any associated function call process. 

data; the data being needed by the routine B to actually carry 25 The message is then logically passed to an A message 

out the requested function. The messaging kernel layer A' state machine for sending. The A message state machine 

prepares a message containing the call data and sends a initiates a message transfer by first issuing a message 

message descriptor 306 to the appropriate messaging kernel descriptor identifying the location of the message and 

layer B'. Assuming that the message is initiating a new indicating, for example, that it is a new message being sent, 

communication transaction, the messaging kernel layer }i' M) The destination of the message descriptor is the shared 

copies the message lo its own shared memory. memory address of the message descriptor FIFO as present 

Based on the message type, the messaging kernel B' on the intended destination peer-level processor. The spe- 

identifies the specific function routine H that needs to be cific message descriptor FIFO is elTcclivcly selected based 

called. Utilizing one of its own stub routines, a call con- on the stub routine called and the data provided with the call, 

taining the data transferred by the message is then made to ."x'i 1 hat is, lor example, the messaging kernel layer 178 corre- 

the routine B. When routine B returns to the stub process lates the FSID provided with the call to the particular file 

from which it was called, the messaging kernel layer B' will system facility 164 that has mounted that particular file 

prepare an appropriate reply message to the messaging system. If the messaging kernel layer 178 is unable to 

kernel layer A'. TTie routine B return may reference data, correlate a FSID with a file system facility 164, as a 

such as the status of the returning function, that must also be 40 consequence of a failure to export or moimt the file system, 

transferred to the messaging kernel layer A'. This data is the NFS request is returned to the client with an error, 

copied into the message before the message is copied back Once the message descriptor is passed to the messaging 

to the shared memory space of the A' peer-level processor. kernel layer 312 of an appropriate peer-level facility, the 

The message copy is made lo the shared memory location multi-tasking kernel of the messaging kernel layer 310 

where the original message was stored on the Al peer-level 45 blocks the sending process until a reply message has been 

processor. Thus, the image of the original iiussagc is logi- received. Meanwhile, the multi-tasking of the layer 310 

cally updated, yet without requiring interaction between the kernel continues to handle incoming messages, initiated by 

two messaging kernel layers to identify a destination storage reading message descriptors from its descriptor FIFO, and 

location for the reply message. A "reply" message descriptor requests for messages to be sent based on calls received 

pointing to the message is then sent to the messaging kernel 50 through the stub routines Al-X. 

layer A. The messaging kemel layer 312 is similar to the messag- 

The messaging kernel layer A', upon successive evalua- ing kemel layer 310, though the implementation of the layer 

tion of the message descriptor and the message type field of specifically with regard to its call format, return format, and 

the message, is able to identify the particular process that stub routines Bl-X differ from their A layer counterparts, 

resulted in the reply message now received. That is, the 55 Where, for example, the messaging kernel layer 312 is the 

process ID as provided in the original message sent and now messaging kernel layer 180 of the file system facility 164, 

returned in the reply message, is read. The messaging kernel the stub routines Bl-X match the functions of the UFS 182 

layer A is therefore able to return with any apphcable reply and device driver 188 that may be directly called in response 

message data to the calling routine A in the relevant process to a message from another facility or that may receive a 

context. 60 function call intended for another facility. Accordingly, the 

A more robust illustration of the relation between two preparation and handling of messages, as represented by the 

messaging kernel layers is provided in FIG. 11. A first B message parser, call format and return format bubbles, 

messaging kernel layer 310 may, for example, represent the will be tailored to the file system facihty. Beyond this 

messaging kernel layer 178 of the network communications difference, the messaging kernel layers 310, 312 are iden- 

peer-level facihty 162. In such case, the series of stub 65 tical. 

routines Al-X include a complete NFS stack interface as The B message state machine implemented by the multi- 
well as an interface to every other function of the network tasking kernel of the messaging kernel layer 312 receives a 
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message descriptor as a consequence of the pccr-lcvcl pro- 
cessor reading the message descriptor from its message 
descriptor FIFO. Where the message descriptor is initiating 
a new message transaction, i.e., the message modifier is zero 
or two, the B message state machine undertakes to copy the 5 
message pointed to by the message descriptor into a newly 
allocated message buffer in the local shared memory of its 
peer-level processor. If the message modifier indicates that 
the message is a reply to an existing message transaction, 
then the B message state machine assumes that the message 
has already been copied to the previously allocated buffer 
identified by the message descriptor. Finally, if the message 
descriptor modifier indicates that the message pointed to by 
the message is to be freed, the B message state machine 
returns it to the B multi-tasking kernel's free message buffer 
pool. 15 

Received messages are initially examined to determine 
their message type. This step is illustrated by the B message 
parser bubble. Based on message type, a corresponding data 
structure is selected by which the message can be properly 
read. The process ID of the relevant sen,'icing destination 20 
process is also read from the message and a context switch 
is made. The detailed reading of the message is illustrated as 
a series of return format bubbles Bl-X. Upon reading the 
message, the messaging kernel layer 312 selects a stub 
routine, appropriate to carry out the function requested by 25 
the received message and performs a function call through 
the stub routine. Also, in making the function call, the data 
contained by the message is formatted as appropriate for 
transfer to the called routine. 

3. IPC Communication Transactions 30 

FIG. 12 illustrates an exemplary series of communication 
transactions that are used for a network communications 
facility or a local host lacility to obtain known data from the 
disk array 24 of the present invention. Similar series of 
communication transactions are used to read directory and 35 
other disk management data from the disk array. For clarity, 
the transfer of messages are referenced to time, though time 
is not to scale. Also for purposes of clarity, a pseudo- 
representation of the message structures is referenced in 
describing the various aspects of preparing messages. 40 
a. LFS Read Transaction 
At a time an NFS read request is received by the 
messaging kernel layer 178 of the network communications 
facility 162 from an executing (sending) process (PID= 45 
ASS). Alternately, the read request at could be from a host 
process issuing an equivalent LFS read request. In either 
case, a corresponding LFS message (message #1) is pre- 
pared (message#l.msg_type=fc_read; message#l.sender_ 
pid=AS$; message#l.dest_pid=BSS). 50 

Tlie destination process (P1D=BSS) is known to the 
messaging kernel layer 178 or 194 as the "manager" process 
of the file system facility that has mounted the file system 
identified by the FSID provided with the read request. The 
association of an FSID with a particular FS facility's PID is 55 
a product of the initialization of all of the messaging kernel 

In general, at least one "manager" process is created 
during initialization of each messaging kernel layer. These 
"manager" processes, directly or indirectly, register with a 60 
"name server manager" process (SC_NAME_SERVER) 
running on the host facility. Subsequently, other "manager" 
processes can query the "name server manager" to obtain the 
PID of another "manager" process. For indirect relations, the 
supervising "manager" process, itself registered with the 65 
"name server manager" process, can be queried for the PIDs 
of the "manager" processes that it super\'ises. 
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For example, a single named "flic system administrator" 
(FC_VICE_PRES) process is utiHzed to supervise the 
potentially multiple FS facilities in the system 160. The 
FC_VICE_PRES process is registered directly with the 
"name server manager" (SC_NAME_SERVER) process. 
The "manager" processes of the respective FS facilities 
register with the "file system administrator" (FC_V1CE_ 
PRES) process — and thus are indirectly known to the "name 
server manager" (SC_NAME_SERVER). The individual 
FS "manager" processes register with the given FSIDs of 
their mounted file systems. Thus, the "name server man- 
ager" (SC_NAME_SERVER) can be queried by an NC 
facility for the PID of the named "file system administrator" 
(FC_VICE_PRES). The NC facility can then query for the 
PID of the unnamed "manager" process that controls access 
to the file system identified by a FSID. 

The function of a non-supervising "manager" process is to 
be the known destination of a message. Thus, such a 
"manager" process initially handles the messages received 

an appropriate local worker process for handling. 
Consequently, the various facilities need know only the PID 
of the "manager" process of another faciUty, not the PID of 
the worker process, in order to send a request message. 

At t3, a corresponding message descriptor (md#lvm6_ 
addr; mod=0), shown as a dashed arrow, is sent to the FS's 
messaging kernel layer 180. 

At t4, the FS messaging kernel layer 180 copies down the 
message (messagettl), shown as a solid arrow, for 
evaluation, allocates a worker process to handle the request 
and, in the context of the worker process, calls the requested 
function of its UFS 182. If the required data is already 
present in the memory resource 18', 
transaction with the S messaging kernel layer 
required, and the FS messaging kernel layer 
immediately at tj,,. However, if a disk read is required, the 
messaging kernel layer 180 is directed by the UFS 182 to 
initiate another communications transaction to request 
retrieval of the data by the storage facility 166. That is, the 
UFS 182 calls a storage device driver stub routine of the 
mes.saging kernel layer 180. Amessage (message#2), includ- 
ing a vector address referencing a buffer location in the 
memory resource 18' (message#2.msg_type=sp_read; 
message#2.vme addr=xxxxh; message#2. sender pid= 
B$$; message#2.dest_pid=C$$), is prepared. At tj, a cor- 
responding message descriptor is sent (md#2vme_addr; 
mod=(l) 1(1 the S messaging kernel layer 186. 

Al l.„ llic S messaging kernel layer 186 copies down the 
message (message#2) for evaluation, allocates a worker 
process to handle the request and calls the requested func- 
tion of its device driver 188 in the context of the worker 
process. Between Ij and t^j, the requested data is transferred 
to the message specified location (message#2.vme_addr= 
xxxxh) in the memory resoiurce 18'. When complete, the 
device driver returns to the calling stub routine of the S 
messaging kernel layer 186 with, for example, the successful 
(err=0) or unsuccessful (err=-1) status of the data transfer. 
Where there is an error, the message is updated 
(message#2.err=-l) and, at 1^2, copied up to the messaging 
kernel layer 180 (md#2vme_addr). A reply message 

descriptor (md#2vme addr; mod=f ) is then sent at tj3 to the 

PC messaging kernel layer 180. However, where there is no 
error, a k_null_reply(msg) is used. This results in no copy 
of the unmodified message at t^,, but rather just the sending 
of the reply message descriptor (md#2vme_addr; mod=f ) at 

Upon processing the message descriptor and reply mes- 
sage (message#2), the FS messaging kernel layer 180 
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unblocks and returns to the calling process of the UFS 182 
(message#2.sender_pid=BSS). After completing any pro- 
cessing that may be required, including any additional 
communication transactions with the storage facihty that 
might be required to support or complete the data transfer, 5 
the UFS 182 returns to the stub routine that earlier called the 
UFS 182. The message is updated with status and the data 
location in the memory resource 18' (message#l.err=0; 
message #2.vme_addr=xxxxh=message#l.vme_addr= 
xxxxh) and, at tj4, copied up to the messaging kernel layer 10 
178 or 194 (md#lvme_addr). A reply message descriptor 
(md#lvme_addr; mod=l) is then sent at t^, to the messag- 
ing kernel layer of the NC or local host, as appropriate. 

The messaging kernel layer 178 or 196 processes the 
reply message descriptor and associated message. As indi- 15 
cated between tj^ and t^y, the messaging kernel layer 178 or 
196, in the context of the requesting process (P1D=A$S), is 
responsible for copying the requested data from the memory 
resource 18' into its peer-level processor's local shared 
memory. Once completed, the messaging kernel layer 178 or 20 
196 prepares a final message (message#3) to conclude iis 
series of communication transactions with the FS messaging 
kernel layer 180. This message is the same as the first 
message (message#3=message#l), though updated by the 
FS facility as to message type (message#3.msg_type=fc_ 25 
read_release) to notify the FC facility that it no longer 
requires the requested data space (message#3.vme_addr= 
xxxxh) to be held. In this manner, the FC facility can 
'ts expedient, centralized control over the memory 
18'. A corresponding message descriptor ■'O 
(md#3vme_addr=md#lvme_addr; mod=0) is sent at t^o- 

At tji, the release message (message#3) is copied down 
by the FC messaging kernel layer 180, and the appropriate 
disk buffer management function of the UFS 182 is called, 
within the context of a worker process of there relevant 
manager process (message#3.dest-Pid=B$$), to release the 
buffer memory (message#3.vme_addr=xxxxh). Upon 
completion of the UFS memory management routine, the 
relevant worker process returns to the stub routine of the FS 
messaging kernel layer 180. The worker process and the 
message (message#3) are deallocated with respect to the FS 
facility and a reply message descriptor (md#3vm6_addr; 
mod=l) is returned to the messaging kernel layer 178 or 196, 
whichever is appropriate. 

Finally, at the messaging kernel layer 178 or 196 
returns, within the context of the relevant process (PID= 
A$$), to its calling routine. With this return, the address of 
the retrieved data within the local shared memory is pro- 
vided. Thus, the relevant process is able to immediately 
access the data as it requires. 

b. LFS Write Transaction 

FIG. 13 illustrates an exemplary series of communication 
transactions used to implement an LFS write to disk. 55 

Beginning at a time t^, an LFS write request is received by 
the messaging kernel layer 178 of the network communica- 
tions facility 162 from an executing process (F1U=ASS) in 
response to an NFS write request. Alternately, the LFS write 
request at tj could be from a host process. In either case, a 60 
corresponding message (message #1) is prepared 
(message#l .msg_type=fc_write; message#l .sender_pid= 

ASS; message#l.dest pid=B$$) and, at t^, its message 

descriptor (md#l vme_addr; mod=0) is sent to the FC 
messaging kernel layer 180. 65 

At Ij, the FC messaging kernel layer 180 copies down the 
message (message#l) for evaluation, allocates a worker 
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process to handle the request by the manager process 
(PID=B$$), which calls the requested function of its UFS 
182. This UFS function allocates a disk buffer in the memory 
resource 18' and returns a vector address (vme_addr= 
xxxxh) referencing the buffer to the FC messaging kernel 
layer 180. The message is again updated (message#2.vme_ 
addr=xxxxh) and copied back to the messaging kernel layer 
178 or 194 (md#lvme_addr). A reply message descriptor 
(md#lvme_addr; mod=l) is then sent back to the messag- 
ing kernel layer 178 or 194, at t^. 

Between tg and tg, the relevant process (PID=ASS) of the 
NC or host facility copies data to the memory resource 18'. 
When completed, the messaging kernel layer 178 or 194 is 
again called, at tg, to complete the write request. A new 
message (message#2=message#l) is prepared, though 
updated with the amount of data transferred to the memory 

resource 18' and message type (message#2msg type=fc 

write_release), thereby implying that the FS facility will 
have control over the disposition of the data. Preferable, this 
message utilizes the available message buffer of message#l, 
thereby obviating the need to allocate a new message buffer 
or to copy data from message#l. The message descriptor 
(md#2vme_addr=md#lvme„addr; mod=0) for this mes- 
sage is sent at tjQ. 

The message is copied down by the FC messaging kernel 
layer 180 and provided to a worker process by the relevant 
manager process (message#2.dest_pid-BSS). While a reply 
message descriptor might be provided back to the messaging 
kernel layer 178 or 194 immediately, at tj,, thereby releasing 
the local shared memory buffer, the present invention adopts 
the data coherency strategy of NFS by requiring the data to 
be written to disk before acknowledgment. Thus, upon 
copying down the message at 1 , the messaging kernel layer 
180 calls the UFS 182 to write the data to the disk array 24'. 
The UFS 182, within the context of the relevant worker 
process, calls the me.ssaging kernel layer 180 to initiate 
another communication transaction to request a write out of 
the data by the storage facility 166. Thus, a storage device 
driver stub routine of the messaging kemel layer 180 is 
called. A message (message#3), including the shared 
memory address of a buffer location in the memory resource 
18' (message #3.msg_type=sp_write; message#2.vme_ 
addr=xxxxh; message#2.sender_pid=B$S; message#2.dest- 
pid=C$$), is prepared. At tjg, a corresponding message 
descriptor is sent (md#3vme_addr; mod=0) to the S mes- 
saging kernel layer 186. 

At tj7, the S messaging kernel layer 186 copies down the 
message (message#3) for evaluation, allocates a worker 
process to handle the request by the manager process 
(PID=C$S), which calls the requested function of its device 
driver 188. Between t^g and 122, the requested data is 
transferred from the message specified location 
(message#3.vme_addr=xxxxh) of the memory resource 18'. 
When complete, the device driver returns to the calling stub 
routine of the S messaging kemel layer 186 with, for 
example, the status of the data transfer (err=0). The message 
is updated (message#3.err=0) and, at 123, copied up to the 
messaging kernel layer 180 (md#3vme_addr). Areply mes- 
sage descriptor (md#3vme_addr; mod=l) is then sent at 1^4 
to the FC messaging kernel layer 180. 

Upon processing the message descriptor and reply mes- 
sage (message#3), the FC messaging kernel layer 180 
returns to the calling process of the UFS 182 
(message#3.sender_pid=B$$). After completing any UFS 
processing that may be required, including any additional 
communication transactions with the storage facility that 
might be required to support or complete the data transfer, 
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the UFS 182 returns to the messaging kernel layer 180. At 
this point, the UFS 182 has completed its memory manage- 
ment of the memory resource 18'. At tjj, the messaging 
kernel layer 180 sends the reply message descriptor 
(md#2vme_addr; mod=l) to the messaging kernel layer 178 : 
or 196, as appropriate, to indicate that the data has been 
transferred to the disk array resource 24'. 

Finally, at tj^, the messaging kernel layer 178 or 196 
returns, within the context of the relevatit worker process, to 
its calling routine. i 

c. NC/Local Host Transfer Transaction 



FIG. 14 illustrates the cc 
delivery of data, as provided from a NC facility process 
(PID=ASS), to an application program executing in the 1^ 
application program layer of the local host facility. The 
packet, for example, could contain new routing information 
to be added to the route data base. However, since the NC 
facility does not perform any significant interpretation of 
non-NFS packets beyond identification as an IP packet, the 
packet is passed to the local host facility. The local host, 
upon recognizing the nature of the non-NFS packet, will 
pass it ultimately to the IP client, as identified by the packet, 
for interpretation. In this example, the IP client would be the 
"route" daemon. 25 

Thus, the transaction begins at t2, with the NC messaging 
kernel layer 178 writing a message descriptor (md#l.vme_ 
addr; mod=0) to the host messaging kernel layer 194. The 
referenced message (message#l.msg_lype=nc_r6cv_ip_ ^ 
pkt; message#l .sender pid=D$$; message#l.dest_pid= 
ESS) is copied down, at t-„ by the host messaging kernel 
layer 194. A reply mcssay,c descriptor (md#l.vinc_ addr; 
mod=3) is then returned to the NC messaging kernel layer 
178 at l„. 

The packet is then passed, by the local host mcssaeing 
kernel layer 194, to the TCP/UDP layers 204 of the local 
host facility for processing and, eventually, delivery to the 
appropriate application program. 

As shown at ti4, the application program may subse- 40 
quently call the host messaging kernel layer 194, either 
directly or indirectly through the system call layer. This call 
could be, for example, issued as a consequence of the 
application program making a system call layer call to 
update the host's IP route database. As described earlier, this 45 
call has been modified to also call the host messaging kernel 
layer 194 to send a message to the NC facility to similarly 
update its IP route database. Thus, a message descriptor 
(md#2.Yme_addr; mod=0) is sent at t^ to the NC messag- 
ing kernel layer 178. The referenced message so 
(message#2,msg_typ6 = nc_add_route; 
message#2.sender___pid=E$S; message#l.dest_pid=D$$) is 
copied up, at tjg, by the NC messaging kernel layer 178. The 
NC messaging kernel layer 178 then calls the NC facility 
function to update the IP route database. Finally, a reply 55 
message descriptor (md#2.vme_addr; mod=l) is returned to 
the local host messaging kernel layer 194 at t^y. 

d. NC/NC Route Transfer Transaction 
FIG. 15 illustrates the routing, or bridging, of a data 60 
packet two NC facility processes. The two NC processes 
may be executing on separate peer-level processors, or exist 
as two parallel processes executing within the same NC 
facility. The packet, for example, is intercepted at the IP 
layer within the context of the first process (PID=A$$). The 65 
IP layer identifies the logical NC facility that the packet is 
to be routed to calls the messaging kernel layer 178 to 



prepare an appropriate message (mcssagc#l). The data 
packet itself is copied to a portion of the memory resource 
18' (vme_addr=xxxxh) that is reserved for the specific NC 
facility; this memory is not under the control of any FS 
facility. 

Thus, at I,, the NC messaging kernel layer 178 writes a 
message descriptor (md#l.vme addr; mod=0) to the second 
messaging kernel layer 178. The referenced message 
(message#l.msg_typ6 = nc_forward_ip_pkt; 
message#l.sender_pid=F$S; message#l.dest_pid=G$S; 
message#l.vme_addr=xxxxh; message#l.ethernet_dst_ 
net=xx) is copied down, at l^, by the second NC messaging 
kernel layer 178. The data packet is then copied, between 14 
and tg, from the memory resource 18' to the local shared 
memory of the second NC peer-level processor. 

Since the first NC facility must manage its portion of the 
memory resource 18', the second NC messaging kernel layer 
178, at tg, returns a reply message descriptor (md#l.vme_ 
addr; mod=1) back to the first NC messaging kernel layer 
178 at tg. This notifies the first NC facility that it no longer 
requires the memory resource 18' data space 
(message#l.vme addr=xxxxh) to be held. In this manner, 
the first NC facility can maintain expedient, centralized 
control over its portion of the memory resource IS'. 

The packet data is then passed, by the second NC mes- 
saging kernel layer 178, to the IP layer of its NC facility for 



4. Detailed Communication Transaction Messages, 
Syntax, and Semantics 
A Notation for Communication Transactions 

A terse notation for use in describing communication 
transactions has been developed. This notation does not 
directly represent the code that implements the transactions, 
but rather is utilized to describe them. A example and 
explanation of the notation is made in reference to a LFS 
type transaction requesting the attributes of a given file. 

The communication transaction: 

fc__get_attributes(FILE,AITRIBUTES); 
identifies that a message with type FC_GET_ 
ATTRIBUTES, the expected format of the message, when 
sent to the FS facility, for example, is a typedef FILE, and 
that when the message is returned, its format is a typedef 
AITRIBUTES. 

A second convention makes it very clear when the FS 
facility, for example, returns the message in the same format 
that it was originally sent. The communication transaction: 

gel_bufler(BUFFER,* * *); 
describes a transaction in which the NC facility, for example, 
sends a typedef BUFFER, and that the message is returned 
using the same structure. 

If a facility can indicate success by returning the message 
unchanged (k_mill_reply( )), then the format is: 

fr66_bu£E6r(BUFFER,*); 

Sometimes, when facilities use standard structures, only 
some of the fields will actually have meaning. The following 
notation identifies meaningful fields: 



This transaction notation describes the s; 
as get_buffer above, but in more detail. The facility requests 
a buffer of a particular length, and the responding facility 
returns a pointer to the buffer along with the buffer's actual 



a. FS Facility Communication Transactions 

The communication transactions that the FS facilities of 
the present invention recognizes, and that the other facilities 
of the present invention messaging kernel layer recognize as 
appropriate to interact with the FS facility, arc summarized 
in Table 4 below. 

TABLE 4 

Summary of FS Communication Transactions 
LPS Configuration Management 

fc_find_manager (FC_MOUNT_T,»»»{ermo,fc__pid} ) 
fc_mount (FC_MOUNT_T,»"{ermo,fc_pid,flle} ) 

fc_unmount (FC_STD_T{partition.isid},*{ermo} ) 
LPS Data Transfer Messages 

fc_read ( FC_RDWR_T{un.in}, 

••*{errno,un.out.{bd,vattr}} ) 
fc_write ( FC_RDWR_T{un.in}. 

"*{errno,un.out.{bd,vattr}} ) 
fc_readdir ( FC_RDWR_T{un.in}. 

-*{errno,un.out.{bd,new offset}} ) 
fc_readliiik ( FC_RDWR_T{un.in.me,un.in.cred}, 

— {eiTno,un.out.bd} ) 
( FC_RDWR_T{un.out.bd},*{errno} ) 



LPS File Management Messages 



(K_MSG,**-) 
( K^MSG,*) 

( FC_STD_T{cred,me,un.mask}, 

FC_FILE_T{errao,vatti} ) 
( FC_SATrR_T, FC_FILE_T{cnno,vaar} ) 
( FC_DIROF T{cred,where}, FC_FILE_T ) 
( FC_CREATE_T, FC_F1LE_T ) 
( FC_DIROP_T{cred,where}, *{err 
( K(;_RKNAMK_i; '{errno} ) 
( FC_LINIL_T, *{errno} ) 
( FC_SYMLINIL_T, *{errno} ) 
( FC_DIROP_T{cred,where}, *{err 
( FC_STATFS_T{fsid},**-^ 



no}) 



fc__link 

fc symliiik 

fc_rmdir 
fc_jtatfs 

VOP, VFS and Other Miscellaneous LPS Messages 



( FC_STD_T{cred,file}, *{errno} ) 
( FC STD T{cred,file,mode}, *{errno 
( FC_STD_T{cred,£sid}, *{errno} ) 



The use of these 
illustrated from the perspective of their use. 

An FS facility process named FC_VICE_PRES directs 
the configuration of aU FS facilities in the system 160. Even 
with multiple instantiations of the FS facility, there is only 
one FC_VICE_PRES process. There are ako one or more 
unnamed manager processes which actually handle most 
requests. Each file system — or disk partition — in the system . 
160 belongs to a particular manager; however, a manager 
may own more than one file system. Since managers are 
unnamed, would-be clients of a file system first check with 
FC_VICE_PRES to get the FS facility pid of the appro- 
priate manager Thus, the FC-VICE PRES process does no 

actual work. Rather, it simply operates to direct requests to 
the appropriate manager. 

To provide continuous service, managers must avoid 
blocking. Managers farm out requests that would block to a 
pool of unnamed file controller worker processes. These go 
details are not visible to FS facility clients. 

The significant message structures used by the FS facility 
are given below. For clarity, the commonly used structures 
are described here. An FSID (file system identifier) identifies 
an individual file system. An FSID is simply the UNIX 65 
device number for the disk array partition which the file 
system lives on. An FC ^FH structure (file controller file 



handle) identifies individual files. It includes an FSID to 
identify which file system the file belongs to, along with an 
inode number and an inode generation to identify the file 



Start-up Mounting and Unmounting 

Once the FC peer-level processor has booted an instan- 
tiation of the FS facility, the first FS facility to boot spawns 
^ an FC_V1CE_PRES process which, in turn, creates any 
managers it requires, then waits for requests. Besides a few 
"internal" requests to coordinate the mounting and unmount- 
ing of files systems is the operation of multiple file system 
facilities. The only request it accepts is: 
5 fc_find_manag6r (FC_MOUNT_T,*** {6rrno,fc_ 
pid}); 

The input message includes nothing but an FSID identifjang 
the file system of interest. The successful return value is an 
FS facility process id which identifies the manager respon- 

} sible for this file system. Having found the manager, a client 
facility with the appropriate permissions can request that a 
file system be made available for user requests (mount) or 
unavailable for user requests (unmount). These requests are 
made by the local host facility, through its VFS/LFS client 

5 interface; requests for the mounting and unmounting of file 
systems are not received directly from client NC facilities. 
The transaction: 

fc_mount (FC_MOUNT_T,***{errno,fc_pid,flle}); 
returns the root file handle in the requested file system. 
' The unmount transaction: 

fc_unmount (FC_STD_T{fsid}, * {errno}); 
returns an error code. (The * in the transaction description 
indicates that a k_null_reply( ) is possible, thus the caller 
J must set errno to zero to detect a successful reply.) 

Data Transfer Messages 

There are four common requests that require the transfer 
data. These are FC_READ, FC_READDIR, 
3 FC_READLINK, and FC_WRITE. The FS facility handles 
these requests with a two message protocol. All four trans- 
actions are similar, and all use the FC_RDWR_T message 
structure for their messages. 



FC_CRED cred; /* 
FC_FH file; ' 



* Structure used in responi 
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-continued 



The FC_READ 
The three by other 
A read dat; 



s described in some detail, 
e described by comparison. ^ 



As sent by a client facility, the "in" structure of the union 
is valid. It specifies a file, an offset and a count. The FS 
facility locks the buffers which contain that information; a 20 
series of message transactions with the S facility may be 
necessary to read the file from disk. In ils reply, Ihc FS 
facility uses the "out" structure to return both the attributes 
of the file and an array of buffer descriptors that identify the 
VME memory locations holding the data. A buffer descriptor 25 
is valid only if it's "buf ' field is non-zero. The FS facility 
uses non-zero values to identify buffers, but to client facili- 
ties they have no meaning. The attributes and buffer descrip- 
tors are valid only if no error has occurred. For a read at the 
end of a file, there will be no error, but all buffer descriptors 30 
in the reply will have NULL "buf fields. 

After the client facility has read the data out of the buffers, 
it sends the same message back to the FS facility a second 
time. This time the transaction is: 

fc_release (FC_RDWR_T{un.out.bd}, *{errno}); 35 
This fc_release request must use the same message that was 
returned by the fc_read request. In the reply to the fc_read, 
the FS facility sets the message "type" field of the message 
to make this work. The following pseudo-code fragment 
illustrates the sequence: 40 

msg=(FC_RDWR_T*)k_alloc_msg( ); 

initialize_message; 

msg=k_send(msg, fc_pid); 

copy_data_from_buffers_into_local_memory; 45 

msg=k send(msg, fc pid); 

The same message, or an exact duplicate, must be returned 
because it contains the information the FS facility needs to 
free the buffers. 

Although the transaction summary of Table 4 shows just 50 
one fc_release transaction, there are really four: one for 
each type of data transfer: fc read release, fc write 
release, fc_readdir_release and fc_read_link_release. 
Since the FS facility sets the "type" field for the second 
message, this makes no difference to the client facility. 55 

If the original read transaction returned an error, or if none 
of the buffer descriptors were valid, then the release is 
optional. 

The FC_WRITE transaction is identical to FC_READ, 
but the cKent facility is expected to write to the locations 60 
identified by the buffer descriptors instead of reading from 

The FC_READDIR transaction is similar to read and 
write, but no file attributes are returned. Also, the specified 
offset is really a magic value — also sometimes referred to as 65 
a magic cookie — identifying directory entries instead of an 
absolute offset into the file. This matches the meaning of the 
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offset in the analogous VFSA'OP and NFS \ 
readdir. The contents of the returned buffers are "dirent" 
stmctures, as described in the conventional UNIX "get- 
dents" system call manual page. 

The FC_READLINK transaction is the simplest of the 
four communication transactions. It returns no fife attributes 
and, since finks are afways read in their entirety, it requires 
no offset or count. 

In all of these transactions, the requested butters are 
locked during the period between the first request and the 
second. Client facilities should send the fc_re lease message 
as soon as possible, because the buffer is held locked until 
they do, and holding the lock could slow down other client 
facilities when requesting the same block. 

In the preferred embodiment of the present invention, the 
these four transactions impfy conventional NFS type per- 
mission checking whenever they are received. Although 
conventional VFS/UFS calls do no permission checking, in 
NFS and the LFS of the present invention, they do. In 
addition, the FS facility messages also supports a "owner 
can always read" permission that is required for NFS. 

LFS File Management Messages 



n transaction: 

fc_null (K_MSG,***); does nothing but uses k_reply( 
)■ 

The communication transaction: 

fc_null_null(K_MSG,*); 
also does nothing, but uses the quicker k_null_r6ply( ). 
Both of these are intended mainly as performance tools for 
measuring message turnaround time. 

The commuuication transaction: 



gets the vnode attributes of the specified file. The mask 
specifies which attributes should be returned. A mask of 
FC_ATTR_ALL gets them all. The same structure is 
always used, but for un-rcqucstcd values, the fields arc 
undefined. 

The communication transaction: 

fc_setattr (FC_SArTR_T,FC_FILE_T{errno,vattr}); 
sets the attributes of the specified file. Like fc_getattr, 
fc_setattr uses a mask to indicate which values should be 
set. In addition, the special bits FC_ATTR_TOUCH_ 
[AMC]TIME can be set to indicate that the access, modify 
or change time of the file should be set to the current time 
on the server. This aUows a Unix "touch" command to work 
even if the times on the client and server are not well 
matched. 

The communication transaction: 

fc_lookup(FC_DIROP_T{cred,where},FC_FILE_T); 
searches a directory for a specified file name, returning the 
file and it's attributes if it exists. The "where" fiefd of 
FC_DIROP_T is an FC_DIROP structure which contains 
a file, a name pointer, and a name length. The name pointer 
contains the vme address of the name. The name may be up 
to 256 characters long, and must be in memory that the FS 
facffity can read. 
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fc_create(FC_CREATE_T,FC_FILE_T); 
creates files. The FC_CREATE_T describes what type of 
file to create and where. The vtype field may be used to 
specify any file type including directories, so mkdir is not ' 
supported. If the "FC_CREATE-EXCL" bit is set in the flag 
field, then fc_create will return an error if the file already 
exists. Otherwise, the old file wiU be removed before 
creating the new one. 



fc_ 



TABLE 5-continued 



nove (FC_DlROP_T{cred,where},»{errno}); 
the specified name from the specified directory. 
The communication transaction: 
fc_rename (FC_RENAME_T,*); 
changes a file from one name in one directory to a different 
name in a (possibly) different directory in the same file 
system. 

The communication transaction: 

fc_link (FC_UNK_T,*{errno}); ^ 
links the specified file to a new name in a (possibly) new 
directory. 

The communication transaction: 

fc_symnnk(FC_SYMLINK_T,*{ermo}); 
creates the specified symlink. 2 

The communication transaction: 

fc_rmdir (FC_DIROP_T{cred,where},* {ermo}); 
removes a directory. The arguments for fc_rmdir are like 
those for fc remove. 

The communication transaction: 3 

Ic stalls (FC_STAl'FS_T{fsid},***); 
returns file system statistics for the file system containing the 

VFS/'VOP LFS Support Transactions 3 

The commiinicalion Iransaclions described below are 
provided to support the VFS/VOP subroutine call interface 
to the LIvS client layer Most VOP calls can be provided for 
using the message already defined above. The remaining ^ 
VOP function call support is provide by the following 



'}); ' 

■ file system, 



le FC_READ 

le FC_WRITE 

le FC READDIR 

le FC_READLINK 

le FC_READ_RELEA.SE 

le FC_WRITE_RELEASE 

le FC_READDIR^RE1£ASE 

le FC_READLINK_RELEASE 

le FC_NULL 

ic FC_NULLJSIULL 

le FC_GETArTR 

le FC_SETATrR 

le FC_LOOKUP 

le FC_CREArE 

le FC_REMOVE 

le FC_RENAME 

le FC_L1NK 

le FC_SYMLINK 

le FC_RMDIR 

le FC_SrArFS 

le FC_FSYNC 



#define FC_SYNCFS 
/* Internal Messages. */ 
#define FC_REG_PARnnON 
#defineFC UNREG PARimON 



6 1 FC ID ) 

7 1 FC_1D ) 
( 8 1 FC_ID ) 
( 9 1 FC_ID ) 

( 10 I FCJD ) 
. I FC_ID ) 
( 12 I FC_ID ) 
( 13 I FC_ID ) 
• 1 I FC_ID ) 
; 1 FC_ID ) 



( 24 ] FC_ID ) 
( 25 I FC_ID ) 
( 26 I FC_ID ) 



The FS facility message structures are listed below. 



The 

fc_fsync (FC_STD_T{cred,flle},*{ermo}); 

fc_syncfs (FC_STD_T{cred,fsid}, *{errnc " 
ensure that afi blocks for the referenced file 01 
respectively, are flushed. 

The communication transaction: 

fc_access(FC_STD_T{cred,file,mode},*{e 
determines whether a given type of file a 
specified credentials ("cred") on the specified file. The mode 
value is "FC_READ_MODE", "FC_WRITE_MODE", 
or "FC_EXEC_MODE". If the mode is legal, the returned 

Table 5 lists the inter-facility message types supported by 
the FS facility. 



K_MSGTYPE type; 
FclcRED cred;' 


'* Access credentials */ 


FC_FH file; 




FC_FS1D fsid; 


/• For fc__get_j5erver. */ 
/• {READ,WRITE,EXEC} 


K_PID pid; 


/* FS facility pid of 




/* Mask attributes. 
(FC_ATrR_*). */ 



/* IN: Which FC to 



10}); 
s is legal for 



le FC_ID ( (long)( ('F'«8) | 

le FC_FINdLmANAGER 
le FC_3dOUNT 
le FC_UNMOUNT 



( 1 1 FC_ID ) 
( 2 1 FC_ID ) 
( 3 1 FC„ID ) 



FC_FARTinON partition; 

K_PID fc^id; 

FC_FH file; 

' } FC_MOUNT_T, 
typedef struct } 

K_MSGTYPE type; 

FC_CRED cred; 

FC_FH file; 

long mask; /* ; 

(FC 

FC_SArrR sattr; 
} FC SATTR r, 
typedef struct { 

K_MSGTYPF type; 
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}FC BUF DESC; 



The FC_BUF_DESC structure is used in the ti 
sage data transfer protocols. A typical sequence is: 



Note that the '■out" union member is the output for the first 
message and the input for the second. 



devices 

} FC_CREArE_T, 
/* Values for the flag. */ 

#define FC_CREArE_EXCL 0x0001 /* Exclusive 
25 lypedcf struct { 

K_MSOTYPE type; 

long errno; 

FC„CRED cred; 

FC_FH from; 

FC_UlROF to; 
} FC_RENAME_T, 



FC_CRED 
FC_FH 
FC_DIROP 
5 }FC_U1SIK_T, 
typedef struct { 

ILMSGTYPE 



FC_CRED cred; 



/* User requested 
file offset. */ 
/* User requested 



FC_BUF_DESC^ bd[FC_RDWR_BUFS]; 



vattr; /* For 

new offset; /* For READDIR. 



} FC_SYMLINK_T; 
typedef struct { 
5 K_MSGTYPE 

FCJSID 

u_long 

u_long 

3 uZlong 



This structure is used in those operations that take a 
directory file handle and a file name within that directory, 
namely "lookup", "remove", and "rmdir". 



} FC_STArFS_T; 
#dcfinc FC_MAXNAMLEN 
I #definc FC_MAXPAT[ILEN 



1024 



K_MSGTYPE 



long d„ott; /* offset of next disk 

directory entry 7 
u_long d__flleno; /• file number of entry •/ 
u__short d_reclen; /* length of this record 7 
u_short d_namlen; /'* length of string in d_name 

char d_name[FC_31AXNAMLEN + IJ; 



le (up to 
MAXNAMLEN + 1) 



Not all fields that can be set can be specified in a crea 
;o instead of including FC_SArTR, only the values that c 
)e set as included. 



b. NC Facility Communication Transactions 
The communication transactions that the NC facilities of 
the present invention recognize, and that the other messag- 
' 65 ing kernel layers of the present invention messaging kernel 
1 layer recognize as appropriate to interact with the NC 
facdity, are summarized in Table 6 below. The NC facility 



nc set ifmetric 
nc_set_ifaddr 
nc_get_Jfaddr 
nc_get_ifstat 



( NC_IFIOCTL_T[uiut,mc_addr}, 
( NC_IF[OCTL_T{unit,flags}, 

( NC_IFIOCTL_T{unit}, 

-nstatm.flags} ) 
(NC IFIOCTL T{unit,metric}, 

""■{status}) 
( NC_IFIOCn,_T{uiiit,if_addr}, 

•••{status} ) 
( NC_IFIOCTU_T{unit}, 

•••{status,iLaddr} ) 
( NC_IFSTATS_T,*** ) 
( NC_IFIOCTL_T(un;t,aags}, 

( NC_IFIOCTL_T(uiiit}, 



nc__get_ip_netmask 

nc_add_arp_entry 

nc_deL_arp_entry 

nc__get^aip_entry 

nc__add_route 

nc_del_route 

NFS Configuration Me 



_T, — ) 
•*) 



nc nfs export 
nc_nfs_unexport 

Network fnterface D 



NC_IN[OCTL_T, 
NC_ARPIOCTL_T, ••• ) 
NC_ARPIOCTU_T, ••• ) 
NC_ARPIOCTU_T, ••• ) 
NC_KnOCrL_T, ••• ) 
NC_KnOCTL_T, ••• ) 



^C_NFS_START_T, * I 

K NFS EXPORT T, "-[ertno} : 

<C_NFS_UNEXPORT_T, ""{err 



nc_j:mit_pkt ( NC_PKT_IO_T,- ) 

nc_recv_dl_pkt ( NC_PKT_IO_T,' ) 

nc_rccv_Jip_pkt ( NC_PKT_IO_T,* ) 

nc_recv_4)romis_pkt ( NC__PKT_IO_T,* ) 
nc_Eorward_Jp_pkt ( NC_PKT_IO_T,* ) 
Secure Authentication Messages 



ks_decrypt ( KS_DECRYPT_T{netname,netnamelen,desblock}, 
***{rpcstatusJc5status,desblock} ) 

ks_getcred ( KS_GETCRED_T{netname,netnamelen}, 
•••{rpcstatus^statuSjCred} ) 



A network communications facility can exchange mes- 
sages with the host facility, file system facility and any other 
network communications facility within the system 160. The 
host facility will exchange messages with the network 
communications facility for configuring the network 
interfaces, managing the ARP table and IP routing table, and 55 
sending or receiving network packets. In addition, the host 
facility will exchange messages with the network commu- 
nications facility for configuring the NFS server stack and to 
respond in support of a secure authentication sendee request. 
The network communications facfiity will exchange mes- 
sages with the file system facility for file service using the 
external FS communication transactions discussed above. 
Finally, a network communication facility will exchange 
messages with other network communication facihties for IP 
packet routing. 

System Call Layer Changes 65 

The exportfs( ), unexporl( ), rlrequest( ), arpioctl( ) and 
in conlrol( ) function calls in the system call layer have 
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been moditicd. The exportfs( ) and uncxport() functions arc 
called to export new file systems and unexport an exported 
file system, respectively. A call to these modified functions 
now also initiates the appropriate NC_NFS_EXPORT or 
NC_NFS_UNEXPORT communication transactions to 
each of the network faciKty. 

The rtrequest( ) function is called to modify the kernel 
routing table. A call to the modified function now also 
initiates an appropriate NC communication transaction 
(NC_.^p_ROUTE for adding a new route or NC_J3EL_ 

3 ROUTE for deleting an existing route) to each of the 
network facility. 

The arpioctl( ) function is called to modify the kernel ARP 
table. This function has now been modified to also initiate 
the appropriate NC communication transaction (NC_ 

. ADD_ARP for adding a new ARP entry or NC_DEL_ARP 

^ for deleting an existing entry) to each of the network facility. 

Finally, the in control( ) function is called to configure 

the Internet Protocol parameters, such as setting the IP 
broadcast address and IP network mask to be used for a 
given interface. This function has been modified also initiate 

] the appropriate NC communications transaction (NC_ 
SET_IP_BRADDR or NC_SET_IP_NETMASK) to the 
appropriate network facility. 
NC Facility Initialization 

When a network communications facility is initialized 
following bootup, the following manager processes are 
created: 



NFS server process for processing 
NFS_EXPORT and NFS_UNEXPOKr 

Network interface control process for 
processing lOCTL communication 
transactions from the host; and 
Network transmit process for processing 
NC_XMIT_PKT and NC_FWD_IP_PKr 



<n> is the network processor number: 0,1,2, or 3. 
<i> is the network interface (LAN) number: 0,1,2,3,4,5,6, 
or 7. 

Once initialized, the NC facilities reports the "names" of 
these processes to a SC_NAME_SERVER manager 
process, having a known default PID, started and running in 
the background, of the host facility. Once identified, the host 
facUity can configure the network interfaces (each LAN 
connection is seen as a logical and physical network 
interface). The following command is typicaUy issued by the 
Unix start-up script for each network interface: 

ifconfig <interface name> <host name> <options> up 

<inl6rfac6 namo is the logical name being used for the 
interface; 

<host name> is the logical host name of the referenced 
<interface namo. 
The ifconfig utihty program ultimately results in two lOCTL 
commands being issued to the network processor: 



nc__set_ifflags( flags = UP + <options> ); 
nc__set_ifaddr( ifaddt^address_of_host- 
name(<host namo) ); 

ITie mapping of <host name> to address is typically 
specified in the "/etc/hosts" file. To start the NFS service, the 
following commands are typically then issued by the Unix 



nfsd <n> 
exportfs -a 

<n> specifies number of parallel NFS server process to 
be started. 5 
The nfsd utility program initiates an "nc_nfs_start" com- 
munication transaction with all network communication 
facilities. The "exportfs" communication transaction is used 
to pass the Hst of file systems (specified in /etc/e3cports) to be 
c3q)ortcd by the NFS server using the "nc-nfs-cxport" com- i" 
munication transaction. 

Once the NFS service is initialized, incoming network 
packets address to the "NFS server UDP port" will be 
delivered to the NFS server of the network commtmications 
facility. It will in turn issue the necessary FS communication 15 
transactions to obtain file service. If secure authentication 
option is used, the NFS server will issue requests to the 
Authentication server daemon running on the host processor. 
The conventional authentication services include: mapping 
(ks_g6tcred( )) a given <network name> to Unix style 20 
credential, decrypting (ks„decrypt( )) a DES key using the 
public key associated with the <network name> and the 
secret key associated with user ID 0 (ic. with the <nctwork 
name> of the local host). 

Routing 25 

Once a network communication facility is initialized 
properly, the IP layer of the network communication facility 
will perform the appropriate IP packet routing based on the 
local routing database table. This routing table is managed 
by the host facility using the "uc_add_route" and "nc_ ■'O 
del_route" lOCl'L commands. Once a route has been 
determined for a particular packet, the packet is dispatched 
to the appropriate network interface. If a packet is destined 
to the other network interface on the same network com- 
munication facility, il is processed locally. If a packet is 35 
destined to a network interface of another network commu- 
nication facility, the packet is forwarded using the "nc_ 
forward_ip_pkt( )" commurrication transaction. If a packet 
is destined to a conventional network interface attached to 
the host faciHty, it is forwarded to the host faciUty using the 
"nc_forward_ip_pkt( )" communication transaction. 

The host facility provides the basic network front-end 
service for system 160. All packets that are addressed to the 
system 160, but are not addressed to the NFS stack UDP 
server port, are forwarded to the host facility's receive 45 
manager process using the following communication trans- 
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nc_rccv_promis_pkt (NC_PKT_IO_T,*); to the 

appropriate network communication facihty. 
Tabfe 7 lists the inler-facility message types supported by 
the FS facility. 

TABLE 7 



^define MAC_IOCTL_CMDS 
Sdeflne NC_REGISTER_DL 
Mefiiie NC_SET MACFLAGS 
#define NC GET MACFLAGS 
#define NC_GEr_IFSTArS 
/* BSD "if ioctl's */ 
^define DLJOCTLCMDS 
^define NC_SET_PROMSIC 
*Jefine NC_ADD_MUm 
#define NC_DEL31Um 
#dcfinc NC_SET_IFFLAGS 
#define NC_GEr_irmLGS 
#define NC_SET_IFMErRIC 
Mefine NC_SET_IFADDR 
Sdeflne NC_GEf_IFAUDR 
/* BSD "in" iocti's */ 
Mefine IN_IOCn^CMDS 
#deflne NC_SET_IP_BRADDR 
ajefine NC_SET_IP_NETMASK 
#define NC_GEr_IP_BRADDR 
Wefine NC_GEr_IP_NETMASK 
;• DSD "arp" ioctl's ♦/ 
#deflne ARP_IOCn^CMDS 
#deflne NC_ADD_ARP 
aiefine NC_DEL_ARP 
Weftne NC_GEr_ARP 
;• BSD "route" ioctl's */ 
^define RT_10CTL_CMDS 
e NC_ADD_ROUTE 
e NC_DEL_ROUTE 
>t/NC to NC dala communication 

e NC_DLXM1T_MSGTYPES ((6 « 4) + NC_1D) 
e NC_XMrT_FKT (NC_DLXM1T_ 
MSGTYPES+O) 
#defiiie NC_FWD_IP_PKr (NC_DLXMIT_ 
MSGTYPES+1) 



do) (type & QxfffffffO) 

((1 « 4) + NC_1D) 
(MAC_lOCTL_CMDS+0) 
(MAC_10CTL_CMDS+1) 
(MAC lOCTL CMDS+2) 
(MAC_I0CTLCMDS+3) 

((2 « 4) + NC_ID) 
(DLJOCn^CMDS+O) 
(DL_IOCTL_CMDS+l) 
(DLJOCn^CMDS+2) 
(DLJOCn^CMDS+3) 
(DLJOCn^CMDS+4) 
(DL_IOCTL_CMDS+5) 
(DLJQCn^CMDS+6) 
(Dl^OCTl^CMDS+7) 

((3 « 4) + NC_ID) 
(IN_IOCn^CMDS+0) 
(IN_IOCn^CMDS+l) 
(IN_IOCn^CMDS+2) 
(IN_IOCn^CMDS+3) 

((4 « 4) + NC_ID) 
(ARP_IOCn^CMDS+0) 
(ARP_IOCrL_CMDS+l) 
(ARP_IOCrL_CMDS+2) 



((5 « 4) + isrc_iD) 

(RT_IOCTI^CMDS+0) 
(RT_I0CTLCMDS+1) 



#define NC_RECV_FROMlS_PKT 
#define NC_RECV_IP_PKT 



nc_rccv_dl_pkt ( NCJICUOJ,* ); 

■where the packet type is not IP; and 
nc_recv_ip_pkt ( NC_PKr_IO_T,* ); 



The communication transaction: 
nc_recv_promis_pkt (NC_PKT_IO_T,*); 
transfers packets not addressed to system 160 to the host 
facility when a network communication facility has been 

configured to receive in promiscuous mode by the host 
facifity. 

To transmit a packet, the host facifity initiates a commu- 



3 /*NFSs 

^define NFS_CMDS 

^define NC_NFS_STAKr 

^define NC_NFS_EXPORT 

Mefrne NC_NFS_UNEXPORT 

#deflne NC_NFS_GETSTAr 

5 Mefine NC_NFS_STOP 



((7 « 4) + NC_[D) 

(NC_DLRECV_ 

MSGTYPES+O) 

(NC_DLRECV_ 

MSGTYPES+1) 

(NC_DLRECV_ 

MSGTYPES+2) 

((8 « 4) + isrc_iD) 

(NFS_CMDS+0) 
(lSrFS_CMDS+l) 
(^fFS_CMDS+2) 
(NFS_CMDS+3) 
(NFS_CMDS+4) 



The NC facility message structures are listed below. 



* exported vfe flags. 

nc_xmit_pkt (NC_PKT_IO_T,*); */ 
to the appropriate network communication facihty. ^^^^ e^™^y 

Finally, the host facility may monitor the messages being 65 •/ 
handled by a network communication facihty by issuing the ttde&ne exmaxaddrs ic 
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} NC_EXADDRLIST; 
• Associated with AUTHJNIX is 



-continued 

int rsxdrcall; 

int ncalls; /• Out - total NFS calls •/ 
int nbadcalls; /* - calls that failed 7 
int reqs[32]; /•- calls Sar each request •/ 
} NC_NFS_STArS_r, 



* Netv 



■k Interface lOCTL co 



#define EXMAXROOTADDRS 10 
typedef struct { 

NC_EXADDRLIST rootaddrs; 
} NC_UNIXEXPOKr; 



th AUTH_DES is a list of nctvsrork na 



I* Only used with IF, MAC and 



} NC_DESEXPORT, 
typedef struct { 



} NC_REGISTER_DL_r, 
typedef struct { 

K_MSGTYPE m_type; 



#define MAXFIDSZ 16 



/' I: s. 



long metric; /• I */ 

struct sockaddr if addr; /* I */ 

} NC_IFTOCrL_r, 
typedef struct { 

K_MSGTYPE m_type; 



I' Only used with IF, MAC and 



K_MSGTYPE I 
long ( 
fsid_t I 

stmct fid fid; ; 



;D for directory being 



long iL_collisions; 
5 } if_stats; 
} NC_IFSTArS_T, 
typedef struct { 

K_MSGTYPE m_typ< 



NC_EXADDRLIST writeaddrs; 
} NC_NFS_EXPORT_T, 
typedef struct { 

K_MSGTYPE ni_type; 



fsid_t 

struct fid fid; 
NC_NFS_UNEXPORT_r, 



/* of directory being 
unexported */ 
/* FID for directory being 
unexported */ 



I rsbadcalls; /• Out - 1 



/* Out - total RPC ca 



5 } NC_INIOCTL_r, 
typedef stmct { 

K_MSGTYPE m_type; 



} NC_ARPIOCrL_T; 
typedef struct { 

K_MSGTYPE m_ 



■A with IF, MAC and 
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'oik Interface Data COmi 



TABLE 8-continued 

Host Facility Message Types 

( SC_RESOLVE_FIFO_T," 
i SC_TIME_REGISTER_T,' 
(• SC_REAL_T[ME_T,*" ); 
; (SC ERR LOG MSG T,* 
: I SC_ERR_J.OG_MSG2,*** 



} PKT_DArA_BUFFER; 

#define MAX_DL_BUFFRAG 4 

#define VME_XFER_3dODE_NORMAL 0 

#dcfinc VME_XFER_BLOCK 1 

#dcfinc VME_XFER_j\EP 2 /' enhanced 

K_MSGTYFE m_type; 

char src_net; /* Source of packet. */ 

char dst_net; /' Destination of packet. */ 

char vme_xfer__mode; /* What transfer mode can be 

char padl; 

short pktlen; /• Total packet length. */ 
short pad2; 

PKT_DArA_BUFFERpkt_buflistrMAX_DL_BUFFRAG+ll; 
} NC_PKT_IO_T: 



K^MSGTYPE type; 
u_long rpcstatu 
u_long ksstatus 



netnamelen; /* length of netname */ 



j KS_DECRYPT_T; 
typedef struct { 

K^MSGTYPE tyi 



} KS_GETCRED_T; 



c. Host Facility Communication Transactions 

The comiminication transactions that the host facility of 
the present invention recognizes and provides are summa- 
rized in Table 8 below. These transactions are used to 
support the initialization and ongoing coordinated operation 
of the system 160. 



Name Service 

The name server daemon ("named") is the Unix host 
facihty process that boots the system and understands all of 
the facility services that are present in the system. That is, 

5 each facility provides at least one service. In order for any 
facility to utilize a service of another, the name of that 
service must be published by way of registering the name 
with the name server daemon. A name is an ascii string that 
represents a service. When the name is registered, the 

3 relevant servicing process PID is also provided. Whenever 
the name server dacmcin is thereafter queried to resolve a 
service iiaiiic, llic [lariic server daemon will respond with the 
relevant process PID il tlic named service is available. This 
one lc\xl of iiKlircetion relieves the need to otherwise 

Rather, the multi-tasking kernels of the messaging kernel 
layers are allowed to establish a PID of their awn choosing 
to each of the named services that they may register. 
The communication transaction: 

D sc_register_fifo (SC_REGISTER_FIFO_X***); 
is directed to the named daemon of the host facility to 
provide notice that the issuing NC, FS, or S facility has been 
started. This transaction also identifies the name of the 
facility, as opposed to the name of a service, of the facility 

5 that is registering, its unique facility ID (VME slot ID) and 
the shared memory address of its me.ssage descriptor FIFO. 
The communication transaction: 

sc_get_sys_config (SC_GET_SYS_CONFIG_T, 

' is used by a booting facility to obtain configuration infor- 
mation about the rest of the system 160 from the name server 
daemon. The reply message identifies all facilities that have 
been registered with the name server daemon. 
ITie communication transaction: 
' sc_init_complete (SC_INIT_COMPLETE_T,***); 
is sent to the name server daemon upon completion of its 
initialization inclusive of handling the reply message to its 
sc__get_sys_config transaction. When the name server dae- 
mon returns a reply message, the facility is cleared to begin 
normal operation. 

~' ■ ■ n transaction: 



TABLE 8 

it Facility Message Types 

( SC_REGISTER_FIFO_T,*'* ); 
( SC_GF.T_SYS_roNFIG_T,***); 
( SC_REGISTER_NAME_T,*** ); 
( SC_INrr_COMPLETE_T,*** ); 
( SC_RESOLVE_NAME_T,*** ); 



sc register name (SC REGISTER NAME T***); 
is used to correlate a known name for a ser\'ice with the 
5 particular PID of a facility that provides the service. The 
names of the typical services provided in the preferred 
embodiment of the present invention are listed in Table 9. 



TABLE 9 



Host Facility Resident 
SC_NAME_SF.RVF.R - the "Name serv 
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TABLE 9-contmued 



TABLE 9-continued 



Operates also to collect and distribute 
information as to the configuration, both 
physical (the total number of NCs present in 
the system and the VME slot number of eacl 
and logical (what system ser\'ices are 
available). 

SC_ERRD - the "ERRD" daemon - execute 
host peer-level processor, or primary host 



facility preS' 



if there 



NC DI^ITtf^executes in a respective 
facility (#). Functions as the Data link 



message request. 
NC_SrArMAN# - < 
facility (#). Function; 



SC_TIMED - the "TIMED" daemon - exec 
host peer-level processor, or primary host 
processor if there is more than one host 
fecility present in the system. Returns the 



provides access to keys which authentica 
FS Facility Resident 



to identify the PID of the ui 



facility (#). Functions as a "statistics 
manager" process on the FC facility to 

S Facility Resident 

S_MANAGER# - executes the respective 
facility (#). All low-level disk requests 
for the disk array coupled to the storage 
processor (#) are directed to this manager 
process. Unnamed worker processes are 
allocated, as necessary to actually carry oi 

S_STATMAN# - executes in a respective 



n transaction: 

' sc_resolve_name (SC_RESOLVE_NAME_T,***); is 
used by the messaging kernel layer of a facility to 
identify the relevant process PID of a service provided 
by another facility. The reply message, when returned 
by the name server daemon, provides the "resolved" 

' process ID or zero if the named service is not sup- 

The communicalion transaction: 
sc_resolve_fifo (SC_RES0LVE_F1F0_T,* * *); 
3 is issued by a facility to the name server daemon the first 
time the facility needs to communicate with each of the other 
facihtics. The reply message provided by the name server 
daemon identifies the shared memory address of the mes- 
sage descriptor FIFO that corresponds to the named service. 
5 Time Service 
Theti 

The 

sc_time„register (SC„TIME_Rli(ilS'rER_r,***); 
is issued by a facihty to the timed daemon to determine the 
system time and to request periodic time synchronization 
messages. The reply message returns the current time. 
The communication transaction: 

5 sc_real_time (SC_REAL_TIME_T,***); 
is issued by the time server daemon to provide "periodic" 
time synchronization messages containing the current time. 
These transactions are directed to the requesting process, 
based the "clicnt_pid" in the originaUy requesting message. 

' The period of the transactions is a function of a default time 
period, typically on the order of several minutes, or when- 



:r the system time is 
Error Logger Service 



lally changed. 



n ("errd") provides a convenient 
iges to the system console for all 



NC_NFS_VP# - I 



(#). C 



a respective NC facility 
NFS for its 



respective NC facility. Accepts messages 
from the host facility for starting and 
stoping NFS and for controUmg the export 
and unexport of selected file systems. 
NC_DLCTRL# - executes in a respective NC 
facility (#). Functions as the Data T.ink 
controller for its NC facility (#). Accepts 



sc_err_log_msg (SC_ERR_LOG_MSG_T,***); 
prints the string that is provided in the send message, while 
the transaction: 

sc_err_log_msg2 (SC_ERR_LOG_MSG2,***); 
provides a message and an "error id" that specifies a print 
65 format specification stored in an "errd message format" file. 
This format file may specify the error message format in 
multiple languages. 



id Constants for the SC__NAMED process. 



■d types. 

BT_NONE ( 
BT_UNIX 

BT_PSA : 
BT_FC ; 
BT_\C 
BT_PLESSEY 
BT_TRACE_ANAL ( 



Host 



BT_MEM 



/* FUe Com 
/• Network Controller. 

,/* Message Trace 

,/* memory board. */ 



#define SC_REAL_TIME 
typedef struct { 

K_MSGTYPE type; 

K_PID client_pid; 



} SC_TIMED_REGISTER_T; 
typedef struct { 

K_MSGTYPE type; 



} SC_REALTIME_T, 

• SC_ERRD: T^es and Structures. 



( 102 1 SC_MSG_GROUP) 



} SLOT_DESC_T; 



* SC_NAMED; Types and st 



6) 



r_GROUP 



#define SC_REGISTER_nFO 
#defineSC RESOLVE FIFO 
#define SC_REGISTER_NAME 
#define SC_RESOLVE_\AME 
#define SC_DELAY 
#define SC_GET_SYS_CONFIG 
#define SC_INrr_COMPLETE 
#define K_MAX_NAME_LEN 



typedef 



ct{ 



(long)( ('S'«8) I CO ) 

(1 I SC_MSG_GROUP ) 
(2 I SO MSG GROUP) 
(3 I SC_MSG_GROUP ) 
(4 I ,SC_MSCt_CtROUP ) 
(5 I SC_MSG_GROUP ) 
(6 I SC_MSG_GROUP ) 
(7 I SC_MSG_GROUP ) 
32 /* Maximum process 
mmo length. */ 



K_MSGTYFE 



M16_FIFO_DESC f 



<.i.sti;r_fifo_T; 



my_j5lot_id; 
sender_j5lot_id; 

name[K_MAX_NAME_LEN]; 



* SC_ERRD message structure 

* Error log usage notes: 

* - Must include "syslog 

* - Priority levels are: 

* LOG_EMERG 
LOG_^ERr 

LOG GRIT 
LOG_[ 



immediately 



LOG_WARNING 
LOG_NOTICE 
LOG_[NFO 
} • LOG_DEBUG 

#dcfinc SC_ERR_LOG_MSG 
#deftne SC_ERR_LOG_MSG2 
#defme ERR_LO0_MSG_LEN (K_MSG_SIZE - 
sizeof(K_MSGTYPE) 
5 - sizeof(short)) 

typedef struct { 

lUVlSGTYPE type; /' SC_ERR_LOG_MSG '/ 

char msg[ERR.._LOG_MSCi_LEN]; /* message */ 

} SC_ERR_LOG_MSG_T; 
typedef struct { 

K_MSGTYPE type; /» SC_ERR_LOG_MSG */ 



typedef struct { 

K_MSGTYPE type; 

short my_slot_id; 

short dest_jlot_id; 

M16_FIFO_DESC fifo_desc; /* 0 => not found */ 
} SC_RESOLVE_FIFO_T 

K_MSGTYPE type; 

char name[K_MAX_NAME_LEN]; 
} SC_REGISTER_NAME_T, 
typedef struct { 

K_MSGTYPE type; 

K_PID pid; /' 0=> not found */ 

char name[K_MAX_NAME_LEN]; /' in] 

} SC_RESOLVE_NAME_T, 
typedef struct { 

K_MSGTYPE type; 

SLOT_DESC_T config[M16_MAX_VSLOTS]; 

} SC_GET_SYS_CONFIG_T; 
typedef struct { 

K_MSGTYPE type; 

short my slot id; 

} SC_INT_COMPLETE_T; 

• SC_TIMED: types and structures. 

#define SC_TIMED_REGISTER ( 101 | SC_31SG_GROUP ) 



short s[40]; 
' long 1[20]; 

}data; 

} SC_ERR_L0G_MSG2_T, 



50 d. S Facility CommuDication Transactions 

The communication transactions that the S facilities of the 
present invention recognize, and that the other messaging 
kernel layers of the present invention recognize as appro- 
55 priate to interact with the S facility, are summarized in Table 
10 below. 



TABLE 10 



tion Trail 



sp_r/w_cache_4Jg 
- sp_ioctl_req 

sp_start_stop msp 



( SP_MSG.*" ); 
(SEND CONFIG MSG,"*); 
( RECEIVE_C0NFIG_31SG,*" 
( SP_RDWR_MSG,"* ); 
( SP_RDWR_31SG,*** ); 
( SP_JOCrL__MSG,*** ); 
( SP_JOCrL_MSG,*** ); 
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TABLE lO-continued 



( SP_MSG," 
; (SP_MSG," 
(SP_MSG,"^ 



TABLE 11-continued 

S Facility Message Types 



The S facility generally oitly responds tt 
transactions initiated by other facilities. However, a few 
communication transactions are initiated by the S facility at 
boot up as part of the initial system configtiration process. 

Each S facility message utilizes the same block message 
structure of the FS and NC facility messages. The first word i 
provides a message type identifier. A second word is gen- 
erally defined to return a completion status. Together, these 
words are defined by a SP_HEADER si 



char bad_drive; 
char sense_j£ey; 
char sense_code; 
} SP_HEADER; 



The reserved byte will be used by the other facilities to 
identify a S facility message. Msg_code and msg_modifier 
specify the S facility functions to be performed. Memory_ 
type specifies the type of VME memory where data transfer 
takes place. The S facility uses this byte to determine the ■ 
VMEbus protocols to be used for data transfer. M6mory_ 
type is defined as: 

03 — ^Primary Memory, Enhanced Block Transfer 

01 — ]x)ca\ Shared Memory, Block transfer 

00 — Others, Non-block transfer 

The completion status word is used by the S facility to 
return message completion status. The status word is not 
written by the S facilily if a message is completed without 
error. One should zero out the completion status of a ■ 
message before sending it to the S facility. When a reply is 
received, one examines the completion status word to dif- 
ferentiate a k_reply from a k_null_reply. 

The bad drive value specifies any erroneous disk drive 

encountered. The higher order 4 bits specify the drive SCSI : 
ID (hence, the drive set); the lower order 4 bits specify the 
S facility SCSI port number. The sense key and sense 
code are conventional SCSI error identification data from 
the SCSI drive. 

The currently defined S facility functions, and identifying : 
le bytes are listed in Table 11. 



TABLE 11 

S Facility Message IVpes 

- No Op 

- Send Configuiation Data 

- Receive Configuration Data 

- S facility IFC Initializatinn 

- Read and Write Sectors 

- Read and Write Cache Pages 

- lOCTL Operation 



OE - Read Message Log : 
OF - Set S facility Interru 



The message completion status word (byte 4-7 of a 
message) is defined as: 

Byte 00— completion status 

01— SCSI device ID and S facility SCSI port munber 

02— SCSI sense key 

03— SCSI sense code 

The completion status byte values are defined below: 

00 — Completed without error 

01 — Reserved 

02— SCSI Status Error on lOCTL Message 

03— Reserved 

04 — An inquired message is wailing to be executed 

05 — An inquired message is not found 

06— VME data transfer error 

07— Reserved 

08 — ^Invalid message parameter 

09 — ^Invahd data transfer count or VME data address 
OA — S fadUty configuration data not available 

OB 13 Write protect or drive fault 

OC— Drive off-fine 

OD — Correctable data check 

OE — ^Permanent drive error or SCSI interface error 

OF — ^Unrecovered data check 

After receiving a message, the S facihty copies the 
contents into its memory. After a message's function is 
completed, a k_rcply or k_null_rcply is used to inform the 
message sender. K_null__reply is used when the processing 
is completed without error; k_reply is used when the 
processing is completed with error. When k_reply is used, 
a non-zero completion status word is written back to the 
original message. Therefore, when a reply is received, a 
message sender checks the status word to determine how a 
message is completed. When k_null_reply is used, the 
original message is not updated. The S facility simply 
acknowledges the normal completion of a message. 

If a message is not directed to a disk drive, it is executed 
immediately. Disk I/O messages are sorted and queued in 
disk arm elevator queues. Note, the INQUIRY message 
returns either 04 or 05 status and uses k_reply only. 
No Op 

The input parameters for this message are defined as: 

sp_noop_msg (SP_MSG,***); 
The only parameter needed for this message is the message 
header. The purpose for this message is to test the commu- 
nication path between the S facility and a message sender. A 
k_null_reply is always used. 
.Send Configuration Data 

The input parameters for this operation are defined as: 

sp send conflg (SEND CONFIG MSG,***); 
This message is used to inform the S facility about the 
operating parameters. It provides a pointer pointing to a 
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configuration data structure. The S facility fetches the con- 
figuration data to initialize its local RAM. The configuration 
data is also written to a reserved sector on each SCSI disk 
such that they can be read back when the S facility is 
powered up. Hence, it is not necessary to send this message 5 
each time the S facility is powered up. 

In the configuration data structure, vme_bus_request_ 
level specifies the S facility data transfer request level on the 
VME bus. The access_mode specifies if the S facility 
should run as independent SCSI drives or as a single logical 
drive. In the latter case, number_of_disks should be same 
as number_of_banks because all nine drives in a bank are 
grouped into a single logical disk. 

Total_sector is the disk capacity of the attached SCSI 
disks. Total capacity of a disk bank is this number multi- 
plying the number_of_disks. When additional di.sk banks i-^ 
are available, they could have sizes different from the first 

bank. Hence, total sector is a three entry array. Stripe 

size is meaningful only when the S facility is running as a 
single logical disk storage subsystem. Different stripe sizes 
can be used for different drive banks. Finally, onHne_ 20 
drive_bit_map shows the drives that were online at the last 
reset. Bit 5 of online_drive_bit-map[l] being set indicates 
drive 5 of bank 1 is onUnc. Total_scctor and onlinc_drivc_ 
bit_map could not and should not be specified by a user. 

The configuration data are written to the disks in a S 25 
facility reserved sector, which is read at every S facility reset 
and power up. When the configuration data are changed, one 
must reformat the S facility (erase the old file systems). 
When this message is completed, a k_reply or k_null__ 
reply is returned. 30 
Receive Configuration Data 

Tlie input parameters far this operation are defined as: 

sp_receive config (RECEIVE CONFIG _ .MSG,***); 
This message requests the S lacilily to return configuration 35 
data to a message sender. Vme_poinler specifies a VME 
memory location for storing the configuration data. The 
same configuration data structure specified n the last section 
will be returned. 

Read and Write Sectors 

The input parameters for this operation are defined as: 

sp_r/w_sector (SP_RDWR_MSG,* * *); 
Unlike most S facility messages, which arc processed 
immediately, this message is first sorted and queued. Up to 45 
200 messages can be sent to the S facility at one time. Up 
to thirty messages are executed on thirty SCSI drives 
simukaneously. The messages are sorted by their sector 
addresses. Hence, they are not served by the order of their 
arrivals. 50 

Tliere are two possible functions specified by this mes- 

msg_mod=00 — Sector Read=01 — Sector Write 
Scsi_id specifies the drive set number. Disk_number speci- 
fies which SCSI port to be used. Sector-count specifies the 55 

number of disk sectors to be transferred. For a sector ^read 

message, 6rase_s6ctor_count specifies the number of sec- 
tors in the VME memory to be padded with zeros (each 
sector is 512 bytes). For a sector_write message, erase_ 

sector count specifies the number of sectors on the disk to 60 

be written with zeros (hence, erased). To prevent sectors 
from being erased inadvertently, a sector_write message 
can only specify one of the two counters to be non-zero, but 
not both. Sector_address specifies the disk sector where 
read or write operation starts. Vme address specifies a 65 
starting VME memory location where data transfer takes 
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There arc three drive elevator queues maintained by the S 
facility for each SCSI port (or one for each disk drive). The 
messages are inserted in the queue sorted by their sector 
addresses, and are executed by their orders in the queue. The 
S facility moves back and forth among queue entries like an 
elevator. This is done to minimize the disk arm movements. 
Separate queues for separate disk drives. These queues are 
processed currently because the SCSI drive disconnects 
from the bus whenever there is no data or command transfer 
activities on the bus. 

If no error condifions are detected from the SCSI drive(s), 
this message is completed normally. When data check is 
found and the S facility is running as a single logical disks, 
recovery actions using redundant data are started automati- 
cally. When a drive is down and the S facility is running as 
a single logical disk, recovery actions similar to data check 
recovery will take place. Other drive errors will be reported 
by a corresponding status code value. 

K_reply or K_null_reply is used to report the comple- 
tion of this message. 
Read/Write Cache Pages 

The input parameters for this operation arc defined as: 

sp_r/w_cache_pg (SP_RDWR MSG,***); 
This message is similar to Read and Write Sectors, except 
multiple vme_addresses are provided for transferring disk 
data to and from disk sectors. Each vme_address points to 

a memory cache page, whose size is specified by cache 

page_size. When reading, data are scattered to different 
cache pages; when writing, data are gathered from different 
cache pages (hence, it is referred to as scatter__gather 
function). 

There are two possible functions specified by this mes- 
sage; 

msg_mod=00— Cache Page Rcad=01— Cache Pane 
Write 

Scsi id, disk number, sector count, and sector address 
are described in Read and Write Sector message. Both 
seclor_address and sector ciutiit must l)c divisible by 
cache_page_size. Furthermore, sector count must be less 
than 160 (or 10 cache pages). Cache_page_size specifies 
the number of sectors for each cache page. Cache pages are 
read or written sequentially on the drive(s). Each page has 
its own VME memory address. Up to 10 vme_addrBsses are 
specified. Note, the limit of 10 is set due to the size of a S 
facility message. Like the sector read/write message, this 
message is also inserted in a drive elevator queue first. 

If no error conditions are detected from the SCSI drive(s), 
this message is completed normally. When an error is 
detected, a data recover action is started. When there is a 
permanent drive error that prevents error recovery action 
from continuing, an error status code is reported as comple- 

K_reply or K_null_reply is used to report the comple- 
tion of this message. 
lOCTL Request 

The input parameters for this operation are defined as: 

sp_ioctI_req (SP_IOCTL_MSG,* * *), 
'litis message is used to address directly any SCSI disk or 
peripheral attached to a SCSI port. Multiple messages can be 
sent at the same time. They are served in the order of first 
come first serve. No firmware error recovery action is 
attempted by the S facility. 

Scsi id, scsi port, and scsi lun address identify 

uniquely one attached SCSI peripheral device. Command_ 
length and data length specify the lengths of command and 
data transfers respectively. Data_buffer_address points to a 
VME memory location for data transfer. The command 
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bytes arc actual SCSI command data to be sent to the 
addressed SCSI peripheral device. Note, the data length 
must be multiples of 4 because the S facility always transfers 

4 bytes at a time. Sense ^length and sense addr specify size 

and address of a piece of VME memory where device sense 5 
data can be stored in case of check status is received. These 
messages are served by the order of their arrivals. 

When this message is terminated with drive error, a 
corresponding status code is returned. K_reply and 
k_null_reply are used to report the completion of this 
message. 

Start/Stop SCSI Drive 

Hie input parameters for this operation are defined as: 

sp_start_stop_msp (SP_IOCTL_MSG,** *); 

This message is used to fence off any message to a 
specified drive. It should be sent only when there is no 
outstanding message on the specified drive. Once a drive is 
fenced off, a message directed to the drive wiU receive a 
corresponding error status back. 

When the S facihty is running as a single logical disk, this 20 
message is used to place a SCSI disk drive in or out of 
service. Once a drive is stopped, all operations to this drive 
will be fenced off. In such case, when the stopped drive is 
accessed, recovery actions are started automatically. When a 
drive is restarted, the data on the drive is automatically 
reconfigured. The reconfiguration is performed while the 
system is online by invoking recovery actions when the 
reconfigured drive is accessed. 

When a drive is reconfigured, the drive configuration 
sector is updated to indicate that the drive is now a part of 30 
a drive set. 
Message Inquiry 

The input parameters for this message are defined as: 

sp_inquiry_msg (SP_MSG,***); 
This message requests the S facility to return the status of a 35 
message that was sent earlier. Ak_reply is always used. The 
status of the message, if available in the S facility buffers, is 
returned in the completion status word. 

This message is used to verify if a previous message was 
received by the S facility. If not, the message is lost. A lost 40 
message should be resent. Message could be lost due to a 
local board reset. However, a message should, in general, 
not be lost. If messages are lost often, the S facihty should 
be considered as broken and fenced off. 
Read Message Log 45 

The input parameters for this message arc defined as: 

sp read message ^buffer msg (SP MS(j, * * * ); 

The S facility keeps a message buffer which contains the last 
200 messages. Data buffer specifies a piece of VME 
memory in which the messages are sent. Number_of_ 50 
message should not exceed 200. Each message is 128 bytes 
long as defined at the beginning of this Section. An appM- 
cation program must allocate a buffer big enough to accom- 
modate all letumed messages. 

Normally this message is sent when there is no active S5 
messages. Otherwise, it is very difficult to determine how 
many used messages are in the S facihty message buffer. For 
example if there are 200 active messages, there will be no 
used ones in the message buffer. Where there are less than 
requested messages in the message buffer, 128 bytes of zeros go 
are transmitted for each shortage. K_reply and k_null_ 
reply are used for the completion of this message. 
SP Interrupt 

The input parameters for this message are defined as; 

sp set sp intermpt msg (SP MSG,***); 65 
This message tells the S facility to pass control to an 
on-board debug monitor, as present in the SP boot rom. After 
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completing this message, the S facility no longer honors any 
messages until the monitor returns control. A k_null_reply 
is always returned for this message. 

The S facility message structures are Usted below: 



vme_addr; 

msg bodylK MSG SIZE - , 



ct psa msg ''rblink; /* point 



} SP_31SG; 
typcdof struct { 



nuinbet_of_banl!s; 



} config_data; 
lypedef struct { 

SP_HEADER header; 

config_data *vme_ptr 



} SEND_CONFIG_MSG; 
typedef struct { 

.SP_HEADER lieader: 



/* byte 0-7 */ 
/* byte 8-11 */ 
/* byte 12-15 si 



( SI' Rt)WR_MSG; 

typedef struct { 



5ctor__address; 
ine_address[10]; 



/• byte 10-11 •/ 
/• byte 12-13 •/ 
/* byte 14-15 •/ 
/* byte 16-19 '/ 
/' byte 20-23 •/ 



/* byte 0-7 •/ 
/* byte 8 */ 
/* byte 9 */ 
/* byte 10-11 */ 
/* byte 12-13 */ 
/* byte 14-15 V 
/* byte 16-19 V 
/• byte 20-23 */ 



IV. Start-up Operations 
A. IPC Initialization 
The chart below 

occur during system boot. 



the system operations that 



Phase 1: All peer-lg 



boot Unix image through boot-level S facility; 
start SC_NAME_^SERVER process; 



IS FIFO for receiving; 
} 

for each ( SP NC EC ) { 

level S fa^cility; 
download boot image and boot parameters 

(including the PID of the SC_NAME_SERVER 

process) to the shared memory program 

store of the peer-level processor; 
start controller; 



send SC_REG_FIFO to SC_NAME_SERVER; 
send SC_0ET_SYS_CONF to SC_NAME_SERVER; 
send SC_INir_CMPL to SC_NAME_SERVER; 
} 

send SC_REG_NAMEs to SC_NAME_3ERVER; 
send SC_RliSOLVli_NAMEs to SC_NAMli_SERVER; 
send SC_RESOLVE_nFOs to SC_NAME_SERVER; 



} 



The SP peer-level processors boot from onboard 
I IPRf )Ms The SP boot program, in addition to providing for 
powcr-on diagnostics and initialization to a ready state, 
includes a complete S facility. Thus, the SP peer-level 45 
processor is able to perform SCSI disk and tape operations 
upon entering its ready state. In their ready states, the NC, 
FC, SP and H processors can be downloaded with a com- 
plete instantiation of their respective types of facilities. The 
downloaded program is loaded into local shared memory; 50 
for the S facility, for example, the program is loaded into its 
local 256K static ram. The ram download, particularly to 
static ram, allows both faster facility execution and use of 
the latest release of the facility software. 

After powering up or resetting the SP processor, the host 55 
facility, executing its boot program, waits for the SP boot 
program to post ready by indicating a ready state value in an 
SP status register. 

Once the S boot program has posted ready, a Sector Read 
message from the host boot program can be used to retrieve hn 
any disk block to any VME memory location. Generally, the 
read request is to load the host facility from disk block 0, the 

boot block. In preparing a read sector message for the S 

facility after power up, the local host boot program spccilics 
the following (in addition to normal read_sector message 65 
contents): 

sender pid=Oxfamf[ dest_pid=0x00000001 
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By specifying the above, the local host boot program signals 
the S facihty to bypass normal IFC reply protocols and to, 
in turn, signal a reply complete by directly by changing the 
OxfiffEIIfif message value in the original message image to any 
other value, such as the value of the message descriptor. That 
is, after building a read sector message, the host boot 
program WTites a message descriptor to the S facility. The 
host boot program can then poll this send6r_pid word to 
determine when the message is completed. Messages to the 
D S facility are sent in this manner until the full host facility 
boot is complete. 

Once the local host boot program has loaded the host 
facility and begun executing its initialization, the host facil- 
ity generally switches over to normal IPC communication 
5 with the S facility. To do this, local host facility sends an IFC 
Initialization message to the S facility. After receiving this 
message, the S facility expects a shared memory block, as 
d by the message, to contain the following informa- 



Byte 00-03 — ^Bootlock, provides synchronization with 

the local host facihty 
Byte 04-05— S facility board slot id, 
Byte 06-07— Reserved, 
Byte 08-09— This board's IFC virtual slot ID 
Byte 10-11 — System controller process number. 
Byte 12-27 — System controller lifo descriptor 
Byte 00-01— System controller lifo type, 
Byte 02-03— System controller slot id 
Byte 04-07— Fifo address 
Byte 08-09— Soft fifo index, 
Byte 10-11— Soft fifo index mask, 
Hyic 12-13 — Interrupt request level, 
Hytc 14-15 — Inlcrrupt vector address. 
Byte 28-31 — Address of this common memory, and 
Byte 32-35 — Size of this common memory. 
Byte 36-39 — ^Hardware flfo address of the S facility 
The first thing the S facility does is check the bootlock 
variable. When it is set to a "BOOTMASTER" value, it 
means the local host facility is up and ready to receive 
message from the S facility. Otherwise, the S facility waits 
for the local host facility to complete its own initialization 
and set the bootlock word. As soon as the bootlock word is 
changed, the S facility proceeds to perform IFC initializa- 
tion. The following IFC messages are sent to the local host 
facility: 

1. Register FIFO 

2. Get System Configuration 

3. Initialization Complete 

4. Register Name 

5. Resolve FIFO 

The second message allows the S faciUty to know who is 
in what VME slots within the system. The S facility will 
only register one name, "SPn" (n is either 0 or 1), with a 
processor ID of 1. Hence all messages directed to the S 
facility specify PID=SP_SLO'f«16+0x0001. Basically, a 
processor ID (PID) is a 4-byte word, in which the higher 
order two bytes contain the processor's VME slot ID. The 
lower order two bytes identify a process within a processor. 

The register FIFO message formally informs the local 
host I'acilily about the S facility's fifo address. The get 
system configuration message retrieves a table describing all 
available processors from the local host facility. After com- 
pleting initialization, using the Initialization Complete 
message, the S facility advertises its services by issuing the 



55 

Register Name message, which informs the host f aciUty that 
the S facility service process is up and running. When 
another facility sends a message to the S facility for the first 
time, the S facility uses a Resolve FIFO message, directed 
to the host facility, to obtain the fifo address needed for a 5 
reply. 

Thus, a multiple facility operating system architecture 
that provides for the control of an efScient, expandable 
multi-processor system particularly suited to servicing large 
volumes of network file system requests has been described, j,. 

Clearly, many modifications in variations of the present 
invention are possible in light of the above teachings. 
Therefore, it is to be understood that within the scope of the 
appended claims, the principles of the present invention may 
be realized in embodiments other than as specifically 
described herein. 

We claim: 

1. A computer system employing a multiple facility oper- 
ating system architecture, said computer system comprising: 

a) a plurality of processor units provided to co-opcralivcly 2" 
execute a predetermined set oi operating system peer- 
level facilities, wherein cacli said processor units is 
associated with a respective dillcrenl one of said oper- 
ating system pccr-lcvcl lacihties and not another of said 
operating system peer level lacilities. and wherein each 25 
of said operating system peer-level facilities constitutes 

a respective separately executed software entity which 
includes a respective distinct set of peer-level facility 
related functions, each said processor unit including: 

i) a processor capable of executing a control program; 30 

ii) a memory store capable of storing said control 
program, said processor being coupled to said 
memory store to obtain access to said control 
program, 35 

said memory store providing for the storage of a first 
control program portion that includes a one of said 
respective distinct sets t)f operating system peer-level 
facility related functions and that corresponds to a one 
of said predetermined operating system peer-level 40 
facihties, and a second control program portion that 
provides for the implementation of a multi-tasking 
interface function, said multi-tasking interface fiinction 
being responsive to control messages for selecting for 
execution a one of said peer-level facility related func- 45 
tions of said one of said predeleniiined operating 
system peer-level facilities and rcspoiisi\ e to said one 
of said predetermined operating system peer-level 
facihties for providing control messages to request or in 
response to the performance of said predetermined 50 
peer-level faciUty related functions of another operat- 
ing system peer-level facflity; and 

b) a communications bus that provides for the intercon- 
nection of said plurality of processor units, said com- 
munications bus transferring said control messages 55 
between the multi-tasking interface functions ol said 
predetermined set of operating system peer-level tacih- 

2. The computer system of claim 1 wherein a farst one ot 
said predetermined set of operating system peer-lev el 1 leili- 
ties includes a network communications facilit\ md i see- 
ond one includes a filesystem facility. 

3. The computer system of claim 2 wherein said nelwt)rk 
communications facility is coupled to a network to permit 
the receipt of network requests, said network commumca- 65 
tions faclKty providing for the identification of a predeter- 
mined filesystem type network request, said multi-tasking 
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interface function of said network communications facility 
being responsive to said predetermined filesystem type 
network request to provide a predetermined control message 
to said filesystem facility to request the performance of a 
predetermined filesystem function. 

4. The computer system of claim 3 further comprising a 
data store that provides for the storage of data, said prede- 
termined filesystem type network request directing said 
network communications facility to transfer predetermined 
data with respect to said network, said data store being 
coupled to said network communications facility for storing 
said predetemiined data. 

5. The computer system of claim 3 or 4 wherein said 
predetermined set of peer-level facilities further includes a 
storage facihty and wherein said filesystem facility provides 
for the performance of said predetermined filesystem 
function, said multi-tasking interface function of said file- 
system facility being responsive to said filesystem facility to 
provide control messages to said storage lacihty to request 

6. Ihe compuler s\slcm oi claim 5 wherein said prede- 
termined sloiage access liiiiclioii directs said storage facility 
to transter said predetermined data, said data store being 
coupled to said storage iacility for storing said predeter- 
mined data. 

7. A computer system miplementing a co-operative facil- 
ity based operating system architecture, said computer sys- 
tem comprising: 

a) a plurality of processors, each being coupled to a 
respective control program store and a respective data 
store, said plurality of processors being intercoimected 
by a communications bus; and 

b) a multiple facility operating system having a kernel and 
providing for the message based co-operative operation 
of said plurality of processors, said multiple facility 
operating system providing for the operating system 
internal execution of a plurality of operating system 
peer-level facilities by execution of each of said peer- 
level facilities by a respective different one of said 
plurality of processors, each of said peer-level facilities 
constituting a respective software entity executed sepa- 
rately &om said kernel, wherein each of said plurality 
of facilities implements a multi-tasking interface cou- 
pleable between said communications bus and a respec- 
tive and unique peer-level control function set to permit 
message transfer between each of said plurality of 
facilities. 

8. The computer system of claim 7 wherein said plurality 
of facib'ties includes a network facility and a filesystem 
utility, wherein said network facility includes a communi- 
cations network peer-level control function coupled between 
a first multi-tasking interface and a network interface and 
said filesystem facility includes a data storage peer-level 
control function coupled between a second multi-tasking 
interface and a filesystem. 

9. The computer system of claim 8 wherein said network 
iacihty is coupled through said network interface to a 
communications network, wherein said network facility is 
responsive to a predetermined network filesystem message 
received via said network interface to provide a predeter- 
mined tilesystem message, and wherein said filesystem 
tacility IS responsive to said predetermined filesystem mes- 
sage to transfer data with respect to said filesystem. 

10. The computer system of claim 9 further comprising a 
common data store, said network facility providing for the 
transfer of data between said network interface and said data 
store, said filesystem facility providing for the transfer of 
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data between said data store and said fllcsystcm, said com- 
munications network peer-level control function directing a 
message to said filesystem peer-level control function iden- 
tifying a predetermined location of data in said data store 
with respect to said predetermined filesystem message. 5 

11. A computer system employing a multiple facility 
operating system to provide for co-operative operation of a 
plurality of processors, 

wherein said operating system includes a kernel and a 
plurality of additional component facilities executed i" 
separately from said kernel, each of said component 
facilities including a facility sub-component, that 
defines the execution operation of a one of said com- 
ponent facilities, coupled to a multi-tasking interface 
sub-component, 15 
wherein said computer system comprises: 

a) a plurality of processors executing said operating 
system, each of said processors including local 
memory for the storage and execution of a respective 
component facility; 

b) a data memory accessible by each of said processors 
for the storage and retrieval of data blocks exchange- 
able between said processors; and 

c) a communications bus coupling said processors and 
said data memory to permit the exchange of control 
messages between said processors and data through 
said data memory, 

and wherein said processors each implement a respective 
different local sub-set of fewer than all of said compo- 
nent facilities that depends through the exchange of 
control messages on the execution of another sub-set of 
said componentized facilities by another of said pro- 
cessors to co-operatively implement said operating 
system. 
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12. The computer system of claim 11 wherein control 
messages communicate any of a faciUty sub-component 
function request, a facility sub-component function 
response, and a facility sub-component identifier of a 
memory space within said data memory to use in connection 
with said sub-component function request. 

13. The computer system of claim 12 wherein said 
pluraUty of component facilities includes a network facility 
and a filesystem facility, wherein a network facility sub- 
component is executed by a first processor to process 
network requests and data transfers and a filesystem facility 
sub-component is executed by a second processor to process 
filesystem requests and data transfers derivative of said 
network requests and data transfers. 

14. The computer system of claim 1, wherein one of the 
processor units in said plurality of processor units is pro- 
vided further to execute a further operating system peer- 
level facility not in said predetermined set of operating 
system peer-level facilities. 

15. The computer system of claim 7, wherein said mul- 
tiple facility operating system provides further for the oper- 
ating system internal execution of a further operafing system 
peer-level facility not in said plurality of operating system 
peer-level facilities, by execution of said further peer level 
facility by one of the processors in said plurality of proccs- 

16. The computer system of claim 7, wherein said kernel 
is a Unix kernel. 

17. The computer system of claim 11, wherein said kernel 
is a Unix kemel. 
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X. RELATED PROCEEDINGS APPENDIX 
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